ConStrains identifies microbial strains in metagenomic datasets

Luo, Chengwei; Knight, Rob; Siljander, Heli; Knip, Mikael; Xavier, Ramnik J; Gevers, Dirk

doi:10.1038/nbt.3319

Analysis
Published: 07 September 2015

ConStrains identifies microbial strains in metagenomic datasets

Chengwei Luo^1,2,3,
Rob Knight ORCID: orcid.org/0000-0002-0975-9019^4,5^nAff11,
Heli Siljander^6,7,
Mikael Knip ORCID: orcid.org/0000-0003-0474-0033^6,7,8,9,
Ramnik J Xavier^1,2,3,10 &
…
Dirk Gevers¹^nAff11

Nature Biotechnology volume 33, pages 1045–1052 (2015)Cite this article

21k Accesses
156 Citations
26 Altmetric
Metrics details

Subjects

Abstract

An important fraction of microbial diversity is harbored in strain individuality, so identification of conspecific bacterial strains is imperative for improved understanding of microbial community functions. Limitations in bioinformatics and sequencing technologies have to date precluded strain identification owing to difficulties in phasing short reads to faithfully recover the original strain-level genotypes, which have highly similar sequences. We present ConStrains, an open-source algorithm that identifies conspecific strains from metagenomic sequence data and reconstructs the phylogeny of these strains in microbial communities. The algorithm uses single-nucleotide polymorphism (SNP) patterns in a set of universal genes to infer within-species structures that represent strains. Applying ConStrains to simulated and host-derived datasets provides insights into microbial community dynamics.

You have full access to this article via your institution.

Download PDF

Diversity within species: interpreting strains in microbiomes

Article 04 June 2020

Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0

Article Open access 19 May 2020

Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4

Article Open access 23 February 2023

Main

Understanding how individual organisms coexist within a microbial community is crucial to understanding community functions. For example, the study of microbial community dynamics is important in human health, including how to maintain or restore a healthy human microbiome. Metagenomics has revolutionized microbiology by addressing some of these issues in a culture-independent manner. However, state-of-the-art metagenomics approaches are often limited to the species level^1,2,3 or to partially assembled population consensus genomes^4,5,6. Evidence that the unit of microbial action can fall below the species level comes from multiple sources, including culturing⁷, single-cell genomics⁸, redundant sequencing of the bacterial gene encoding 16S rRNA⁹, sequencing of internal transcribed spacers¹⁰, multilocus sequence typing¹¹ and high-resolution analyses of genomic variation¹². Therefore methods that enable strain resolution from metagenomics datasets are desirable.

Most existing culture-free approaches to identify bacterial strains in communities have drawbacks, which has limited wide adoption of these approaches. For example, single-cell sequencing requires expensive and laborious efforts in cell sorting and suspension, and thus this approach is not used to analyze large communities. Similarly, Hi-C, a sequencing-based approach¹³, requires extra steps and budget for cross-linking, library construction and sequencing. Strain typing methods that leverage strain-level gene copy number variations¹⁴ or strain-level phylogenetic marker SNPs such as canSNPs¹⁵, PathoScope¹⁶ and Sigma¹⁷ rely on the availability of complete reference strain genomes and, with current limitations on these resources, face challenges in studies of the broader diversity found using metagenomic sequencing approaches. An assembly-based approach is dependent on several factors, including genome structure and intraspecies divergence. With rare exceptions, assemblers usually fail to produce individual strain assemblies, instead creating either highly fragmented contigs or contigs that only represent population consensus sequences^18,19; a recent effort in using variation-aware contig graphs for strain identification²⁰ relies on manual inspection, and hence its accuracy is subject to users' experience. In all of these approaches, only a relatively small fraction of strain genomes have been successfully analyzed, and their distribution is usually biased²¹. On the other hand, methods based on single marker genes such as the gene encoding 16S rRNA often lack the resolution to reliably capture intraspecific genomic differences²².

To overcome this difficulty and increase the utility of metagenome dataset, we developed Conspecific Strains (ConStrains), an algorithm that exploits the polymorphism patterns in a set of universal bacterial and archaeal genes to infer strain-level structures in species populations. Using both in silico and previously published host-derived datasets, we found that ConStrains recovers intraspecific strain profiles and phylogeny with high accuracy, and captures important features of community dynamics including dominant strain switches and rare strains. The simulated datasets address performance in the context of different within-population diversities, different numbers of strains, the interference from other species within the same community, as well as the scalability of the method using a large in silico cohort with 322 samples. Predicted within-species structures as well as the strain genotypes were highly accurate across these simulated datasets. Applying this method to an infant gut development metagenomic dataset revealed new insights of strain dynamics with functional relevance. ConStrains is implemented in Python, and the source code is available as Supplementary Code and is freely available together with full documentation at https://bitbucket.org/luo-chengwei/constrains.

Results

The ConStrains algorithm

Guided by reference species, the ConStrains algorithm compares raw metagenomic reads to reference genomes and identifies patterns in SNPs as the basis for differentiation and quantification of conspecific strains. This approach is fundamentally different from other reference-dependent methods such as Sigma and PathoScope^16,17 that rely on availability of a comprehensive reference strain collection because, unlike those methods, ConStrains can provide reliable predictions for species with only one genome (complete or draft). For confident SNP calling, a species requires a minimum of tenfold coverage (Supplementary Fig. 1) within or across all samples considered, which is obtained for all species with a relative abundance of >1% at typical sequencing depths of 5 Gbp. When applied to multiple samples, for example, a longitudinal time series or otherwise related samples, strain identities can be traced across the different samples. The algorithm achieves this in two operations: (i) identifying species for which SNPs are detected and quantified, and (ii) transforming individual SNPs into SNP profiles that represent individual strains.

The first operation is a two-step process. Because the algorithm identifies strains only for those species with sufficient sequencing depth (≥10-fold coverage in at least one sample; Supplementary Fig. 1), the first step uses MetaPhlAn¹ for rapid species composition profiling. For those species with sufficient sequencing depth, a custom database of marker genes is created from the comprehensive PhyloPhlAn marker set²³, against which the raw reads are mapped using Bowtie2 (ref. 24). This targeted approach allows for optimized time and computational efficiency. Resulting marker gene alignments are processed with SAMtools²⁵ to generate a table of coverage by base position from which SNPs are identified. It is important to note that in this process the reference sequences are removed, and SNPs are identified de novo to minimize reference dependency (Fig. 1a–d and Online Methods). We verified that such a SNP selection procedure is sufficiently accurate and uniquely sensitive to disentangle intraspecific diversity (Supplementary Note 1 and Supplementary Fig. 2).

**Figure 1: Overview of the ConStrains algorithm: from raw metagenomic data to strain profiles and uniGcodes.**

In the second operation, individual SNPs are combined into unique SNP profiles from which strains are identified. Previous methods have approached the challenge of identifying individual organisms from microbial communities using SNPs (for example, oligotyping²⁶ and minimum entropy decomposition²⁷), but were limited to SNPs within the span of a sequence read length. ConStrains overcomes this read-length limitation and represents each strain by a barcode-like string of concatenated SNPs spanning hundreds of genes, referred to as the 'uniGcode'. To derive the strain's uniGcodes in a dataset, ConStrains constructs candidate models of strain combinations using a combination of SNP-flow and SNP-type clustering algorithms. The relative abundance of strains in each candidate model across the cohort is estimated sequentially using a Metropolis-Hastings Markov chain Monte-Carlo approach (Fig. 1e–g and Online Methods). Finally, to choose the optimal model with the principle of balancing model fitness and complexity, corrected Akaike information criterion (AICc) is used (Fig. 1h and Online Methods). ConStrains repeats these steps for each species with sufficient coverage, then outputs the number of strains and their respective uniGcodes and relative abundances (Fig. 1i). The uniGcode allows downstream analysis such as cross-sample comparisons and evolutionary studies.

ConStrains identifies strains in large datasets

To validate the performance of ConStrains for strain profiling, we used in silico and host-derived datasets. We generated 36 different sets of k-strain mixtures using in silico genome-based Illumina paired-end read simulation based on ten different Escherichia coli strains whose complete genomes are publicly available, representing real-life scenarios of strain admixtures (k = 2–7; Fig. 2a,b, Supplementary Fig. 3a and Supplementary Table 1). We profiled these 36 sets of reads by ConStrains using default settings, and compared the predicted results with the 'true' strain compositions using Jensen-Shannon divergence (JSD; Fig. 2b and Supplementary Fig. 3b). ConStrains successfully predicted the underlying intraspecies compositions in all 36 datasets (P < 1 × 10⁻⁵; two group t-test against random guesses; Fig. 2b), demonstrating a substantial advantage (Supplementary Fig. 4) over reference-based approaches, with an improvement of 0.191 JSD on average (Supplementary Note 1 and Supplementary Fig. 5). In 34 of the 36 sets of reads (94.44%), the numbers of strains inferred exactly matched the ground truth (Fig. 2a), with the remaining two sets of reads having an additional chimeric strain predicted at an extremely low level (<0.1%). We therefore set the recommended detection limit at 0.1% to reduce such errors computationally. As this is a relative abundance threshold, one can still target low-abundance organisms by increasing sequence depth. In similar simulations with up to 30 E. coli strains, ConStrains predicted the strain composition with high confidence when the strain number was less than ten (Fig. 2c), which represents the intraspecific upper bound for more than 95% of metagenomic species (Fig. 2d and Supplementary Note 1). To assess the impact of intraspecies recombination on performance, we generated both real sequencing reads from highly recombined Burkholderia pseudomallei strains²⁸ and in silico–simulated recombinant strain–derived reads, and identified no significant adverse impact (Supplementary Note 1). We also tested the performance in a more realistic metagenomic scenario by embedding E. coli strains in communities with various levels of complexity and found that our approach remained robust (Online Methods, Supplementary Note 2 and Supplementary Table 2). We also found no significant correlation between admixture compositions' alpha diversity and prediction accuracy. These results collectively suggested good algorithm performance (Supplementary Note 1).

Figure 2: ConStrains correctly predicts the strain composition of *in silico*–simulated datasets in a comparison of true and predicted strain composition profiles of *in silico*–simulated multistrain mixtures.

We then tested ConStrains using a host-derived metagenomic dataset that had previously been analyzed using a manually curated strain-identification approach. Using manual strain curation, the authors had for the first time described the changes in an infant gut microbiome during the first 24 d of life⁴. All three manually curated Staphylococcus epidermidis strains reported in this study were successfully predicted by ConStrains in a fully automated manner, with the predicted relative abundances of each strain over time having highly similar values to the original compositions quantified from the scaffold coverage (JSD average = 0.024, s.d. = 0.021; Supplementary Fig. 6). Because the performance of ConStrains' fully automated approach matched well with the manually curated, semiautomated approach described previously⁴, but required far less machine and manual resources (ConStrains completed the infant gut dataset in 8.5 CPU hours with RAM peak footprint of <2 GB on a Linux server with Xeon 2.6 GHz processors, in contrast to days to weeks of manual curation after assembly), we next applied ConStrains to a very large dataset for which a manual effort would not be feasible (for detailed resource usage, see Supplementary Note 5 and Supplementary Table 3).

In the absence of the existence of such a large dataset (especially one where true results were known), we used a simulated shotgun dataset with intraspecific structure mimicking the natural relative abundance of taxa informed by a recent gut microbiome collection effort for which samples were collected daily over the course of one year²⁹ (Fig. 3a, Online Methods and Supplementary Note 3). ConStrains analyzed 91 species with sufficient depth in the 322 in silico samples. In total, ConStrains analyzed 3.2 terabases of paired-end reads containing 1,361 strains from 320 species, with minimal runtime and infrastructure requirements (Supplementary Note 3). ConStrains achieved high accuracy for individual samples, and also captured crucial information such as dominant strain type changes, for example in Bacteroides fragilis (Fig. 3a–c, Supplementary Table 4 and Supplementary Note 3). This large cohort also enabled us to test factors that might affect the performance of ConStrains, including population complexity, coverage and relatedness. We found that 10× coverage was necessary for accurate profiling and that strain relatedness could also affect performance (Supplementary Fig. 7 and Supplementary Note 3). With this thorough benchmarking, we next applied ConStrains to two previously published clinical datasets to illustrate the biological insights strain-level analyses can provide.

**Figure 3: ConStrains scales to large time series and accurately predicts strain dynamics.**

ConStrains reconstructs strain phylogeny

In a published report on the genetic variation of Burkholderia dolosa in cystic fibrosis patients, a selective culturing step had been combined with a deep population sequencing approach³⁰. We reanalyzed that dataset using our ConStrains algorithm and predicted a total of six B. dolosa strains in the population (strains abbreviated as pop-I to pop-VI; Fig. 4a) with an abundance well above 0.1%. We compared the uniGcodes from the six strains inferred by ConStrains with the isolate genome sequence by building a phylogenetic tree, and found that all of the colony strains and two population strains (pop-I and pop-II) were closely related (Fig. 4a). Moreover, the combined relative abundance of pop-I and pop-II represented the majority of the population (51.3% and 27.9% for pop-I and pop-II, respectively). This finding corroborated observations based on the colony sequencing approach. In addition, the ConStrains algorithm identified four additional, less abundant strains (pop-III to pop-VI). None of these strains could be inferred by the colony sequencing approach alone, likely because of their low abundance (11.2%, 8.1%, 1.0% and 0.5%, respectively). To validate these additional predictions, we further examined the polymorphism patterns in these four strains, and compared them against pop-I and pop-II. We found patterns that are unlikely to have resulted from chimeric mixtures of SNPs from pop-I and pop-II (P < 0.01, permutation test; Fig. 4b). This analysis demonstrated that application of ConStrains to cross-sectional datasets, used in addition to a culture-based approach, allows for a comprehensive and efficient discovery of rare strains.

Figure 4: High-sensitivity identification of strain phylogeny using ConStrains in a reanalysis of data from a published study on the genetic variation of *B. dolosa* populations within cystic fibrosis patients³⁰.

Uncovering strain dynamics in infant gut development

We next analyzed an infant gut development dataset containing 54 samples from 9 subjects (indexed subjects 1–9; Fig. 5) collected over the first three years of life (Online Methods and Supplementary Fig. 8) to further explore the ability of ConStrains to reveal strain dynamics. We ran a ConStrains analysis on 75 species that had sufficient sequencing depth for analysis (10×; Fig. 5). Because previously reported strain-detection algorithms had been limited to studying the population consensus sequences¹², and ConStrains has the capability to untangle intraspecies diversity, we first examined the number of strains observed in each species. Nearly all species (94.66%) had more than two strains, with an average of 4.88 strains per subject (±1.54 s.d.; Supplementary Fig. 9). By tracking the uniGcode of each strain in separate individuals, we identified several unique strain-level longitudinal patterns. For instance, the population of Fecalibacterium prausnitzii usually comprised several strains that maintained a co-dominant profile, in which the strains maintained the same order of abundance; Ruminococcus gnavus had highly variable behaviors over time, with different strains dominating the intraspecies composition at different time points; in contrast, Bacteroides ovatus contained one dominant strain over time, keeping other strains relatively rare. Bifidobacterium bifidum strains demonstrated comparable dynamic patterns similar to F. prausnitzii; moreover, the strains reestablished the same intraspecific diversity even after the abundance of the species dropped below the detection limit (Fig. 5). We anticipate that the capability of generating better insights in intraspecies dynamics of such health-related species^31,32,33 will shed light on the role of these organisms in human physiology.

**Figure 5: ConStrains analysis revealed species longitudinal dynamics and functional shifts within an infant gut development cohort.**

With this goal in mind, we pursued our findings in Bifidobacterium longum, an organism linked to human gut health and applied to prevention and treatment of several diseases³³. We first observed that the phylogeny of B. longum strains strongly correlated with their host origins (Fig. 5), which indicated strong individuality of B. longum strains. Moreover, in subjects 4 and 6 (Fig. 6a), we observed switches in dominant strain types that were highly correlated with the overall relative abundance of the B. longum species. As previous work has shown that a single operon can affect the competitiveness of different Bacteroides fragilis strains³⁴, we evaluated functional differences between different dominant strains. In both subjects, the different strains dominating during consecutive phases (period 2 in subject 4 and period 1 for subject 6; Fig. 6a) carried additional functions that might be crucial to B. longum's successful colonization of the host gut. In particular, the presence of the human milk oligosaccharide (HMO) utilization cluster has been shown to result from an adaptation to the human infant gut³⁵ (Fig. 6b). Some additional functions might underlie formation of a B. longum bloom, including the presence of the fructose and L-fucose utilization gene clusters (Fig. 6b). Together, these findings might explain why strains with these functions were associated with higher relative abundance of B. longum in the infant gut microbiome. We also observed functions specific to strains that were dominant in periods when B. longum was less abundant (periods 1 and 3 in subject 4 and period 2 in subject 6; Fig. 6a), most notably that the capsular polysaccharide biosynthesis genes were absent from dominant strains when B. longum was more abundant (Fig. 6b). Taken together, strain-level insights provided by ConStrains, combined with functional analyses, could offer candidate targets and hypotheses for future studies.

**Figure 6: Functional differences in *B. longum* strains at different time points during infant gut microbiome development.**

Discussion

We show that the ConStrains algorithm accurately predicts strain-level profiles in large cohorts of metagenomic samples, and that the inferred uniGcodes reconstruct strain phylogeny, within or across cohorts, allowing combined cohort studies. ConStrains is scalable and has minimal resource requirements. In contrast, other approaches^14,16,17 are largely dependent on prior knowledge of reference strain genomes, with subspecies resolution being directly dependent on the number of available reference strains per species. This greatly limits the application of such methods on real metagenomic data, because for most of the human microbiome species only one reference genome is available¹⁴. Current databases are quickly gaining in intraspecies genome representation, but are still far from saturating natural diversity. With just one genome per species, ConStrains can resolve natural diversity occurring within that species, and is therefore agnostic to unknown strains. Future improvements for strain-level analysis include identification of strains in the absence of any reference genomes. It is conceivable that combining ConStrains with de novo genome assembly from metagenomic data could be an appropriate candidate to overcome this hurdle.

ConStrains is particularly effective for obtaining insights that were previously overlooked using species-level findings (Supplementary Note 4 and Supplementary Figs. 10–12), and will thus enable new types of studies. As we showed with the B. longum example, combining strain-level profiles with reference genome–based gene coverage analysis can reveal candidate genes for understanding strain-specific beneficial effects and the functions that might contribute to successful colonization in the human gut. ConStrains could also identify strains or genes associated with disease and link variable genomic regions to individual strains, a major challenge in shotgun metagenomics. Strain-level profiles, together with appropriate metadata, could link reference-based or de novo–assembled genes with individual strains and help interpret unknown strain-specific functions. Our study of the infant gut development cohort captured HMO utilization cluster enrichment shifts in different development periods, which is particularly important because products of the HMO utilization cluster are essential for B. longum to exert its probiotic effects³⁶. Finally, strain phylogeny could be used across cohorts and add metagenomic means to test fundamental ecological hypotheses, including neutral theory and other adaptive and nonadaptive mechanisms for maintaining sympatric diversity among strains. Although we applied ConStrains to human microbiome datasets, it can also be applied to environmental samples to test fundamental hypotheses about the role of microbes in the environment that can only be addressed at the strain level.

Methods

ConStrains algorithm.

Extracting target species and informative SNPs. With raw reads from samples S₁, S₂, ..., S_n, ConStrains starts with profiling input metagenomes using MetaPhlAn¹ (v1.7) with default settings, with the exception that alignment options are set to 'very-sensitive'; species with average coverage higher than a coverage cutoff (default: 10×) in at least one sample are selected for further strain analysis. For each of the selected species, the corresponding set of the universally conserved genes reported in ref. 1 are used as a database, and Bowtie2 (ref. 24) mapping with default setting is carried out to map reads against those reference genes. Only reads with proper pairing and orientation, no indels, >30 mapping quality, >90 length mapped (overhanging part at gene 5′ and 3′ ends is clipped off before calculation) and at least 95% nucleotide identity with the reference gene are used. These reads are then piled up onto their respective reference sequences using SAMtools²⁵, and the reference gene coverage is subsequently calculated on a per-base basis. To filter out genes with spurious mappings due to hypervariable regions or conserved universal motifs, sites with less than default minimum coverage, as well as those outside of the 1.5 interquartile coverage range across the gene length, are masked. Any gene with more than 30% of its length masked is discarded from further analysis. SNPs are then counted across samples as those unmasked positions where the minor allele had at least two counts or more than 3% in relative abundance.

Strain typing by SNP-flow algorithm. With SNPs extracted, ConStrains first infers the strain composition and their SNP types using the 'SNP-flow' algorithm in per-species, per-sample fashion. In this algorithm, all SNP sites are first hierarchically clustered by the Euclidean distance between the frequencies of different alleles defined as

where a and b are the frequency vector of the four bases sorted in descending order of the respective SNPs. Clusters that contain less than 5% of the overall SNPs or fewer than ten SNPs are discarded. The centroid of each cluster is selected as representative. These SNP cluster centroids (SCCs) are then ranked in descending order based on their weight quantified as the number of SNPs they represent. Finally, a directed graph is constructed using these SCCs, in which nodes are alleles in these SCCs and each is assigned a 'capacity' defined by the allele frequency, and these alleles from neighboring SCCs are connected by edges (Fig. 1e).

In the directed graph constructed in the previous step, nodes are denoted from the same SCC as a layer. With m layers in the graph, SNP-flow will explore all possible combinations of paths from the first layer to the last. This way, every such path represents a strain genotype, and its relative abundance, c, is defined as the lowest node capacity among all nodes on the path. Once a path is visited, all nodes on this path would subtract their capacity by the path's relative abundance c (Fig. 1e). Such a pathfinding and visiting step is repeated until all nodes' capacities are zero, and the visited paths constitute one combination. ConStrains exhausts all possible SNP-type (strains) combinations β = {β₁, β₂, ..., β_k} in each sample with the i^th sample's SNP-type β_i = bⁱ₁bⁱ₂...bⁱ_h where bⁱ_j is one of the four bases, A, C, G and T, and the associated strain profile α_i = (αⁱ_1,αⁱ_2,...αⁱ_h) with

For each sample, ConStrains picks the optimal combination that minimizes the fitting error, defined as the discrepancy between expected SNP frequencies and observed frequencies, ɛ, defined as:

where E_ij is expected frequency of the i^th base at the j^th SNP locale; and similarly, O_ij is the observed frequency of the i^th base at the j^th SNP locale in the pileup of aligned reads from the corresponding sample. For instance, C is the second base (i = 2), and if we observed two Cs and eight As at the fifth SNP locale (j = 5) in the pileup of aligned reads against reference, the frequency of C is 0.2 at that position and thus is referred to as O₂₅ = 0.2. E_ij is inferred using α_i and β_i such that

Inferring strain compositions. To unify these optimal SNP types into cohort-wide strains, ConStrains next constructs a neighbor-joining tree of the SNP types from different samples based on sequence percentage identity, and utilizes an internal parameter, Δ_d, defined as the distance between the tree-cutting point and the leaves, to cut the tree. Rather than using a preset value, the algorithm cuts this tree using all possible Δ_d. Each internal node created by such a cut could be viewed as the representative of all the children nodes (SNP types) on the tree. In doing so, it identifies all possible k clusters defined by the structure of the tree of SNP types (Fig. 1f), which we refer to as candidate strains.

With the proposed k strains from the previous step, in each sample, we need to find a composition, α^*= (α^*₁, α^*₂, ..., α^*_k) with

to minimize the discrepancy between expected SNP frequencies and observed frequencies, ɛ, as defined previously. This is carried out by a Metropolis-Hasting Monte-Carlo method. ConStrains first generates a number of seeds (default: 1,000) at uniform random on k − 1 simplex. The top 50 seeds are then selected and each such seed's vicinity on the k − 1 simplex is iteratively searched. In iteration t, a new point, α^t_ik, is picked within the 0.01 radius of the previous point, α^{t − 1}_ik; and it is accepted as the new point with probability min(1, q(α^t_ik, α^{t − 1}_ik)), where q(α^t_ik, α^{t − 1}_ik) = ɛ(α^t_ik)/ɛ(α^{t − 1}_ik). It repeats the iteration until |1 − q(α^t_ik, α^{t − 1}_ik)|is smaller than 0.001 or the maximum number of iterations (10,000) is reached. The composition yielding the lowest ɛ is selected as optimal α^*_ik. ConStrains repeats this step for all samples and all k, yielding solutions for each k, α^*_k = (α^*₁, α^*₂, ..., α^*_n), with corresponding error (Fig. 1g):

Selecting the optimal strain model. Corrected AICc is employed to select optimal k. The AICc of each k is calculated as:

where L = 1 − ɛ_k denotes the model likelihood. ConStrains selects the k with the lowest AICc and outputs the associated SNP types and compositions as final results (Fig. 1h). As noted previously, we suggest filtering strains with less than 0.1% in relative abundance as they present a high probability of being chimeric.

In silico datasets.

To simulate in silico single species datasets, 62 complete E. coli genome sequences were downloaded from an NCBI database. Ten genomes were selected and their relatedness was shown by a maximum likelihood tree (Supplementary Fig. 3a) constructed from concatenated nucleotide sequences of core genes among the 10 strains using a method similar to ref. 19. 1,000 random compositions were sampled on a gamma distribution with k = 1 and θ = 0.5 for each number of strains (N = 2–7). In each set of these 1,000 compositions, Shannon entropy was calculated and based on which these compositions were ranked. The compositions on the 15^th, 30^th,..., 90^th percentiles were picked to form a gradient of intraspecific diversity for each N. ART simulator³⁷ was employed to simulate 100× coverage of 100-bp paired-end Illumina reads using these compositions with default settings for Illumina and library settings as “-m 350 -s 50” (Supplementary Fig. 3a). These samples were further grouped together to simulate single strain series samples (Supplementary Table 1).

These simulated E. coli reads were then spiked into in silico–constructed metagenomes to measure the impact from other species. Three human microbiome-like metagenomes with low, medium, and high complexity level (referred as LC, MC and HC, respectively) were simulated based on an aggregated MetaPhlAn¹ profile over all 690 Human Microbiome Project (HMP) samples³⁸. E. coli and Shigella were excluded from the profile, and the rest of the species were ranked based on their average abundance in the HMP cohort. The top 20, 50, and 100 most abundant species were selected for LC, MC and HC, respectively. The species composition in each in silico metagenome was calculated as their relative abundance in the HMP cohort, normalized by their total sum. Genomes of these species were downloaded from NCBI, and a representative strain was selected at random if multiple strains of the same species were present. A total of 100 million 100-bp paired-end Illumina reads were simulated for each set by ART simulator³⁷ with the same settings as mentioned previously. Additional datasets for testing the sensitivity and the performance on different numbers of strains and recombined strains were generated in a similar fashion using ART (Supplementary Note 1).

The year-long shotgun metagenome cohort with 322 samples was simulated based on donor A's 16S rRNA amplicon profiles reported in ref. 29. The operational taxonomy unit (OTU) table was used as a guide for community composition in human microbiomes. To allow simulation at the strain level, however, taxonomy in the OTU table was shifted down by one level. For instance, species composition in the original OTU table was shifted to be the strain composition. NCBI draft and complete genomes were used to match as closely as possible the phylogeny of the original OTUs. Reads were then simulated by ART simulator as previously described. The coverage was set to be 1× per 25 read counts in the 16S OTU table.

Biological datasets.

The two infant gut development longitudinal metagenomic datasets used in this study were from a previous study⁴ and from our recent effort in tracking nine subjects in a three-year period since birth. For the former set, all metagenomic samples were downloaded from NCBI SRA under accession number SRA052203, and the corresponding assembled S. epidermidis strains and phage genomes were downloaded from ggKBase as described⁴. For the latter set, 54 stool samples were collected from nine infant subjects between September 2008 and August 2010 in Finland. Samples were first collected by the subjects' parents and stored in the household freezer before being transferred on dry ice to a laboratory −80 °C freezer. Samples were then shipped to the Broad Institute for DNA extraction, in which QIAamp DNA Stool Mini Kit (Qiagen, Inc.) was used as described previously³⁹. Library construction was carried out following Human Microbiome Project's standard protocol (http://hmpdacc.org/resources/tools_protocols.php), and 101-bp paired-end reads were produced on an Illumina HiSeq 2000 platform. The raw sequences of these samples are available at SRA under BioProject accession number PRJNA269305, and the corresponding sample information is available in Supplementary Table 5.

Prediction accuracy measurement.

To measure how close the predicted composition, P, is from the true composition, Q, we applied Jenson-Shannon divergence with minor modifications. As it is possible that P and Q are of different dimensions, we first padded the one with lower dimension with zeros to match the one with the higher dimension, and then defined a composition M based on sorted P and Q, P′ and Q′, as:

Therefore the Jenson-Shannon divergence is:

where D(X||Y) is the Kullback-Leibler divergence defined as:

We calculated the SNP typing accuracy as the distance between the inferred SNP tree of strains, kT_p, and the true strain tree constructed from concatenated core genes, T_q. First, a distance similar to the symmetric difference introduced by Robinson and Foulds⁴⁰ was applied to calculate the distance, d, between these two trees. We then normalized d to the expected basal distance from a random tree with the same leaves. The expected basal distance, d, is the mean distance between T_q and 1,000 randomly generated trees with the same leaves.

Accession codes

Accessions

BioProject

PRJNA269305

Sequence Read Archive

SRA052203

References

Segata, N. et al. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods 9, 811–814 (2012).
Article CAS Google Scholar
Sunagawa, S. et al. Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods 10, 1196–1199 (2013).
Article CAS Google Scholar
Darling, A.E. et al. PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2, e243 (2014).
Article Google Scholar
Sharon, I. et al. Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res. 23, 111–120 (2013).
Article CAS Google Scholar
Nielsen, H.B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).
Article CAS Google Scholar
Imelfort, M. et al. GroopM: an automated tool for the recovery of population genomes from related metagenomes. PeerJ 2, e603 (2014).
Article Google Scholar
Luo, C. et al. Genome sequencing of environmental Escherichia coli expands understanding of the ecology and speciation of the model bacterial species. Proc. Natl. Acad. Sci. USA 108, 7200–7205 (2011).
Article CAS Google Scholar
Kashtan, N. et al. Single-cell genomics reveals hundreds of coexisting subpopulations in wild Prochlorococcus. Science 344, 416–420 (2014).
Article CAS Google Scholar
Faith, J.J. et al. The long-term stability of the human gut microbiota. Science 341, 1237439 (2013).
Article Google Scholar
Maslunka, C., Gifford, B., Tucci, J., Gurtler, V. & Seviour, R.J. Insertions or deletions (Indels) in the rrn 16S–23S rRNA gene internal transcribed spacer region (ITS) compromise the typing and identification of strains within the Acinetobacter calcoaceticus-baumannii (Acb) complex and closely related members. PLoS ONE 9, e105390 (2014).
Article Google Scholar
Han, D. et al. Population structure of clinical Vibrio parahaemolyticus from 17 coastal countries, determined through multilocus sequence analysis. PLoS ONE 9, e107371 (2014).
Article Google Scholar
Schloissnig, S. et al. Genomic variation landscape of the human gut microbiome. Nature 493, 45–50 (2013).
Article Google Scholar
Beitel, C.W. et al. Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products. PeerJ 2, e415 (2014).
Article Google Scholar
Greenblum, S., Carr, R. & Borenstein, E. Extensive strain-level copy-number variation across human gut microbiome species. Cell 160, 583–594 (2015).
Article CAS Google Scholar
Karlsson, E. et al. Eight new genomes and synthetic controls increase the accessibility of rapid melt-MAMA SNP typing of Coxiella burnetii. PLoS ONE 9, e85417 (2014).
Article Google Scholar
Hong, C. et al. PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples. Microbiome 2, 33 (2014).
Article Google Scholar
Ahn, T.H., Chai, J. & Pan, C. Sigma: strain-level inference of genomes from metagenomic analysis for biosurveillance. Bioinformatics 31, 170–177 (2015).
Article CAS Google Scholar
Miller, J.R., Koren, S. & Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics 95, 315–327 (2010).
Article CAS Google Scholar
Luo, C., Tsementzi, D., Kyrpides, N.C. & Konstantinidis, K.T. Individual genome assembly from complex community short-read metagenomic datasets. ISME J. 6, 898–901 (2012).
Article CAS Google Scholar
Nijkamp, J.F., Pop, M., Reinders, M.J. & de Ridder, D. Exploring variation-aware contig graphs for (comparative) metagenomics using MaryGold. Bioinformatics 29, 2826–2834 (2013).
Article CAS Google Scholar
Lasken, R.S. & McLean, J.S. Recent advances in genomic DNA sequencing of microbial species from single cells. Nat. Rev. Genet. 15, 577–584 (2014).
Article CAS Google Scholar
Ivanova, N. et al. Genome sequence of Bacillus cereus and comparative analysis with Bacillus anthracis. Nature 423, 87–91 (2003).
Article CAS Google Scholar
Segata, N., Bornigen, D., Morgan, X.C. & Huttenhower, C. PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes. Nat. Commun. 4, 2304 (2013).
Article Google Scholar
Langmead, B. & Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Article CAS Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article Google Scholar
Eren, A.M. et al. Oligotyping: Differentiating between closely related microbial taxa using 16S rRNA gene data. Methods Ecol. Evol. 4, 1111–1119 (2013).
Article Google Scholar
Eren, A.M. et al. Minimum entropy decomposition: unsupervised oligotyping for sensitive partitioning of high-throughput marker gene sequences. ISME J. 9, 968–979 (2014).
Article Google Scholar
Nandi, T. et al. Burkholderia pseudomallei sequencing identifies genomic clades with distinct recombination, accessory, and epigenetic profiles. Genome Res. 25, 129–141 (2015).
Article CAS Google Scholar
David, L.A. et al. Host lifestyle affects human microbiota on daily timescales. Genome Biol. 15, R89 (2014).
Article Google Scholar
Lieberman, T.D. et al. Genetic variation of a bacterial pathogen within individuals with cystic fibrosis provides a record of selective pressures. Nat. Genet. 46, 82–87 (2014).
Article CAS Google Scholar
Sokol, H. et al. Faecalibacterium prausnitzii is an anti-inflammatory commensal bacterium identified by gut microbiota analysis of Crohn disease patients. Proc. Natl. Acad. Sci. USA 105, 16731–16736 (2008).
Article CAS Google Scholar
Crost, E.H. et al. Utilisation of mucin glycans by the human gut symbiont Ruminococcus gnavus is strain-dependent. PLoS ONE 8, e76341 (2013).
Article CAS Google Scholar
Di Gioia, D., Aloisio, I., Mazzola, G. & Biavati, B. Bifidobacteria: their impact on gut microbiota composition and their applications as probiotics in infants. Appl. Microbiol. Biotechnol. 98, 563–577 (2014).
Article CAS Google Scholar
Lee, S.M. et al. Bacterial colonization factors control specificity and stability of the gut microbiota. Nature 501, 426–429 (2013).
Article CAS Google Scholar
Schell, M.A. et al. The genome sequence of Bifidobacterium longum reflects its adaptation to the human gastrointestinal tract. Proc. Natl. Acad. Sci. USA 99, 14422–14427 (2002).
Article CAS Google Scholar
Sela, D.A. et al. The genome sequence of Bifidobacterium longum subsp. infantis reveals adaptations for milk utilization within the infant microbiome. Proc. Natl. Acad. Sci. USA 105, 18964–18969 (2008).
Article CAS Google Scholar
Huang, W., Li, L., Myers, J.R. & Marth, G.T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).
Article Google Scholar
Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).
Morgan, X.C. et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 13, R79 (2012).
Article CAS Google Scholar
Robinson, D.F. & Foulds, L.R. Comparison of phylogenetic trees. Math. Biosci. 53, 131–147 (1981).
Article Google Scholar

Download references

Acknowledgements

We thank N. Nedelsky for editorial support. This work was supported in part by the Crohn's and Colitis Foundation of America, the Leona M. and Harry B. Helmsley Charitable Trust, US National Institutes of Health grants U54 DK102557 (R.J.X.) and R01 DK092405 (R.J.X.)., the Howard Hughes Medical Institute (R.K.) and the Finnish Centre of Excellence in Molecular Systems Immunology and Physiology Research 2012-17 (Academy of Finland, Decision No. 250114; M.K.).

Author information

Rob Knight & Dirk Gevers
Present address: Present addresses: Departments of Pediatrics and Computer Science and Engineering, University of California San Diego, La Jolla, California, USA (R.K.) and Janssen Human Microbiome Institute, Janssen Research and Development, Cambridge, Massachusetts, USA (D.G.).,

Authors and Affiliations

Broad Institute of Massachusetts Institute of Technology (MIT) and Harvard, Cambridge, Massachusetts, USA
Chengwei Luo, Ramnik J Xavier & Dirk Gevers
Gastrointestinal Unit and Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA
Chengwei Luo & Ramnik J Xavier
Center for Computational and Integrative Biology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA
Chengwei Luo & Ramnik J Xavier
Department of Chemistry and Biochemistry, University of Colorado at Boulder, Boulder, Colorado, USA
Rob Knight
Howard Hughes Medical Institute, Boulder, Colorado, USA
Rob Knight
Children's Hospital, University of Helsinki and Helsinki University Hospital, Helsinki, Finland
Heli Siljander & Mikael Knip
Research Programs Unit, Diabetes and Obesity, University of Helsinki, Helsinki, Finland
Heli Siljander & Mikael Knip
Folkhälsan Research Center, Helsinki, Finland
Mikael Knip
Department of Pediatrics, Tampere University Hospital, Tampere, Finland
Mikael Knip
Center for Microbiome Informatics and Therapeutics, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
Ramnik J Xavier

Authors

Chengwei Luo
View author publications
You can also search for this author in PubMed Google Scholar
Rob Knight
View author publications
You can also search for this author in PubMed Google Scholar
Heli Siljander
View author publications
You can also search for this author in PubMed Google Scholar
Mikael Knip
View author publications
You can also search for this author in PubMed Google Scholar
Ramnik J Xavier
View author publications
You can also search for this author in PubMed Google Scholar
Dirk Gevers
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.L. and D.G. conceived the project, C.L. designed and implemented the algorithm, C.L., D.G. and R.J.X. designed the experiments, and C.L. performed the analysis. M.K., H.S., D.G. and R.J.X. collected and sequenced the samples. C.L., R.K., R.J.X. and D.G. wrote the paper.

Corresponding authors

Correspondence to Ramnik J Xavier or Dirk Gevers.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Minimum requirements for sequencing depth and species’ relative abundance for ConStrains analysis based on 10x coverage cutoff

Gray lines mark the boundaries for associated genome size; the qualified space is the upper right hand side of the boundary. The boundaries are calculated as the sequencing throughput (in Gbp), S, needed for a 10-fold coverage over a species with relative abundance, r%, and average genome size of g Mbp as S = glr. Note that a sequence depth across the whole genome is correlated with depth of the marker genes such that the larger the genome size, the more sequencing will be required to result in a 10x coverage in these marker genes (and overall genome), when keeping other factors (including relative abundance) constant.

Supplementary Figure 2 ConStrains’ intra-species phylogenetic sensitivity in detecting strains.

(a) A neighbor-joining (NJ) tree is shown based on average nucleotide identity (ANI) of the 62 complete E. coli genomes from NCBI used in assessing ConStrains’ sensitivity towards genome relatedness. (b) The ANI distribution of all genome pairs among the 62 genomes (n = c² = 1,891) covers the typical intra-species ANI range, 97%-100%. Representative 62 genome pairs were randomly selected to synthesize Illumina reads of strain mixtures as described in online methods. For simplicity, all in silico-synthesized Illumina read sets followed a 3:1 strains mixing ratio. (c) Though the number of informative SNP sites and the genome relatedness are strongly negatively correlated, ConStrains captured sufficient (>10) informative SNPs in 99.5% of all sets. (d) Among the 99.5% of all analyzable sets, ConStrains robustly recovered strain profiles with high precision (low JSD error rate). (e) At each divergence level (highlighted by red solid line along the tree shown in a), we show the three E. coli strains drawn to represent the branch (branches identified by blue circles, selected strains identified by red circles next to strain names in a) along with the corresponding predicted composition and JSD errors in comparison to the true composition.

Supplementary Figure 3 In silico test sets simulated by mixing ART-generated Illumina paired-end reads of E. coli strains at different depths with corresponding ConStrains performance.

(a) Unrooted tree on the left was constructed from concatenated nucleotide sequences of core genes shared between these ten E. coli strains; the heatmap on the right shows the coverage of each strain (rows) in each data set (columns), separated by the number of strains it contains, and ranked by its diversity measured by Shannon entropy (top panel). (b) Corresponding ConStrains performance is shown; graphs show the same data as Figure 2b with different y-axis values to increase readability. No significant differences in accuracy were observed between +LC, +MC, or +HC and pure predictions (two-tailed P = 0.536, 0.319, and 0.48 for +LC, +MC, and +HC, respectively, paired t-test; Supplementary Note 2).

Supplementary Figure 4 Comparisons between the predictions by ConStrains and Sigma over in silico E. coli data sets.

The grid represents the same strain admixtures as described in Figure 2. ConStrains’ predictions were based on a single E. coli reference genome (1-Ref.). Sigma’s predictions are shown based on one reference genome (1-Ref.), three reference genomes (3-Ref.), and all ten E. coli genomes used to simulate the admixtures (10-Ref.).

Supplementary Figure 5 Prediction of strain composition according to the number of core gene SNPs recombined between strains

Numbers next to circle mark the exact number of SNPs recombined in silico.

Supplementary Figure 6 The predicted Staphylococcus epidermidis strain profile from time-series metagenomes reported by Sharon et al, with comparison to the original relative abundance of strains inferred from de novo-assembled strain genome scaffolds with manual curation.

N/A.

Supplementary Figure 7 Factors that affect the accuracy of ConStrains analysis in the year-long simulation.

(a) The green reference strain is significantly more divergent than the other four reference strains (one-tailed P = 3.5 × 10^-12, t-test). (b-d) Relationships between prediction accuracy measured by JSD (y-axis of panels b-c) and the complexity of samples (Shannon index, x-axis; panel b) and the coverage of samples (x-axis; panel c). In c, the 10x cutoff for coverage is marked by the vertical dashed green line. A significant (P = 0.452, Pearson’s correlation) negative correlation between JSD and sample Shannon index was observed (b and d).

Supplementary Figure 8 Overview of times-series metagenomes from the infant gut development study.

Nine subjects (horizontal lines) were chosen and the associated sampling points (solid circles) are sorted along the longitudinal axis (with time on the x-axis).

Supplementary Figure 9 Overview of strains inferred by ConStrains for the infant gut microbiome development cohort.

(a) The uncollapsed tree shows strain-level resolution of the same tree shown in Figure 5a. (b) The average number of strains per species per subject is shown.

Supplementary Figure 10 Examples of potential linkage (FDR < 0.5) between intra- specific alpha diversity and the corresponding species’ relative abundance in the community.

N/A

Supplementary Figure 11 ConStrains analysis reveals associations only observable at the strain level in the infant gut development cohort.

For each subject, significant association networks between strains were identified, whereas no species-level significant associations (FDR < 0.05) were observed. In network diagrams, each node represents a species, and the edges connecting nodes represent strains of the corresponding species that were significantly associated. The width of the edges is proportional to the number of pairs of associated strains, and the color of the edges is color-coded for the direction of associations. Below the network is the corresponding heatmap, whose upper right triangle represents the number of positively associated strains and the lower left triangle represents the number of negatively associated strains. The blue dashed oval captures the relationship between R. torques and S. wadsworthensis detailed in Supplementary Figure 12.

Supplementary Figure 12 Examples between R. toques and S. wadsworthensis of subject 8 as highlighted in Supplementary Figure 11.

At strain level (left panels), a significant positive association was found between the two highlighted strains of the two species. In contrast, no significant associations were found at the species level (right panels).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Luo, C., Knight, R., Siljander, H. et al. ConStrains identifies microbial strains in metagenomic datasets. Nat Biotechnol 33, 1045–1052 (2015). https://doi.org/10.1038/nbt.3319

Download citation

Received: 10 December 2014
Accepted: 16 July 2015
Published: 07 September 2015
Issue Date: October 2015
DOI: https://doi.org/10.1038/nbt.3319

This article is cited by

Comparative analysis of metagenomic classifiers for long-read sequencing datasets
- Josip Marić
- Krešimir Križanović
- Mile Šikić
BMC Bioinformatics (2024)
High-resolution strain-level microbiome composition analysis from short reads
- Herui Liao
- Yongxin Ji
- Yanni Sun
Microbiome (2023)
Fast and accurate metagenotyping of the human gut microbiome with GT-Pro
- Zhou Jason Shi
- Boris Dimitrov
- Katherine S. Pollard
Nature Biotechnology (2022)
Dissecting the dominant hot spring microbial populations based on community-wide sampling at single-cell genomic resolution
- Robert M Bowers
- Stephen Nayfach
- Tanja Woyke
The ISME Journal (2022)
STRONG: metagenomics strain resolution on assembly graphs
- Christopher Quince
- Sergey Nurk
- Aaron E. Darling
Genome Biology (2021)