Main

Understanding how individual organisms coexist within a microbial community is crucial to understanding community functions. For example, the study of microbial community dynamics is important in human health, including how to maintain or restore a healthy human microbiome. Metagenomics has revolutionized microbiology by addressing some of these issues in a culture-independent manner. However, state-of-the-art metagenomics approaches are often limited to the species level1,2,3 or to partially assembled population consensus genomes4,5,6. Evidence that the unit of microbial action can fall below the species level comes from multiple sources, including culturing7, single-cell genomics8, redundant sequencing of the bacterial gene encoding 16S rRNA9, sequencing of internal transcribed spacers10, multilocus sequence typing11 and high-resolution analyses of genomic variation12. Therefore methods that enable strain resolution from metagenomics datasets are desirable.

Most existing culture-free approaches to identify bacterial strains in communities have drawbacks, which has limited wide adoption of these approaches. For example, single-cell sequencing requires expensive and laborious efforts in cell sorting and suspension, and thus this approach is not used to analyze large communities. Similarly, Hi-C, a sequencing-based approach13, requires extra steps and budget for cross-linking, library construction and sequencing. Strain typing methods that leverage strain-level gene copy number variations14 or strain-level phylogenetic marker SNPs such as canSNPs15, PathoScope16 and Sigma17 rely on the availability of complete reference strain genomes and, with current limitations on these resources, face challenges in studies of the broader diversity found using metagenomic sequencing approaches. An assembly-based approach is dependent on several factors, including genome structure and intraspecies divergence. With rare exceptions, assemblers usually fail to produce individual strain assemblies, instead creating either highly fragmented contigs or contigs that only represent population consensus sequences18,19; a recent effort in using variation-aware contig graphs for strain identification20 relies on manual inspection, and hence its accuracy is subject to users' experience. In all of these approaches, only a relatively small fraction of strain genomes have been successfully analyzed, and their distribution is usually biased21. On the other hand, methods based on single marker genes such as the gene encoding 16S rRNA often lack the resolution to reliably capture intraspecific genomic differences22.

To overcome this difficulty and increase the utility of metagenome dataset, we developed Conspecific Strains (ConStrains), an algorithm that exploits the polymorphism patterns in a set of universal bacterial and archaeal genes to infer strain-level structures in species populations. Using both in silico and previously published host-derived datasets, we found that ConStrains recovers intraspecific strain profiles and phylogeny with high accuracy, and captures important features of community dynamics including dominant strain switches and rare strains. The simulated datasets address performance in the context of different within-population diversities, different numbers of strains, the interference from other species within the same community, as well as the scalability of the method using a large in silico cohort with 322 samples. Predicted within-species structures as well as the strain genotypes were highly accurate across these simulated datasets. Applying this method to an infant gut development metagenomic dataset revealed new insights of strain dynamics with functional relevance. ConStrains is implemented in Python, and the source code is available as Supplementary Code and is freely available together with full documentation at https://bitbucket.org/luo-chengwei/constrains.

Results

The ConStrains algorithm

Guided by reference species, the ConStrains algorithm compares raw metagenomic reads to reference genomes and identifies patterns in SNPs as the basis for differentiation and quantification of conspecific strains. This approach is fundamentally different from other reference-dependent methods such as Sigma and PathoScope16,17 that rely on availability of a comprehensive reference strain collection because, unlike those methods, ConStrains can provide reliable predictions for species with only one genome (complete or draft). For confident SNP calling, a species requires a minimum of tenfold coverage (Supplementary Fig. 1) within or across all samples considered, which is obtained for all species with a relative abundance of >1% at typical sequencing depths of 5 Gbp. When applied to multiple samples, for example, a longitudinal time series or otherwise related samples, strain identities can be traced across the different samples. The algorithm achieves this in two operations: (i) identifying species for which SNPs are detected and quantified, and (ii) transforming individual SNPs into SNP profiles that represent individual strains.

The first operation is a two-step process. Because the algorithm identifies strains only for those species with sufficient sequencing depth (≥10-fold coverage in at least one sample; Supplementary Fig. 1), the first step uses MetaPhlAn1 for rapid species composition profiling. For those species with sufficient sequencing depth, a custom database of marker genes is created from the comprehensive PhyloPhlAn marker set23, against which the raw reads are mapped using Bowtie2 (ref. 24). This targeted approach allows for optimized time and computational efficiency. Resulting marker gene alignments are processed with SAMtools25 to generate a table of coverage by base position from which SNPs are identified. It is important to note that in this process the reference sequences are removed, and SNPs are identified de novo to minimize reference dependency (Fig. 1a–d and Online Methods). We verified that such a SNP selection procedure is sufficiently accurate and uniquely sensitive to disentangle intraspecific diversity (Supplementary Note 1 and Supplementary Fig. 2).

Figure 1: Overview of the ConStrains algorithm: from raw metagenomic data to strain profiles and uniGcodes.
figure 1

(a) ConStrains requires raw metagenomic reads from a single or series of metagenomic samples as input. (b) To select species that satisfy a predefined sequencing depth cutoff, the algorithm starts by determining the species composition with MetaPhlAn1. (c) Next, Bowtie2 (ref. 24) is used to recruit all reads to a reference database of species-specific marker genes23. (d) SNPs are called on these recruited reads after quality filtering, removal of reference gene sequence and reference-free read realignment. (e) Resulting SNPs are used by an SNP-flow algorithm to infer all possible SNP types for each of the samples. (f) Such SNP types across samples are clustered using a tree structure based on their distances to represent candidate strain models; the internal distance cutoff, Δd, is varied to exhaust all possible SNP-type clusterings. (g) The Metropolis-Hastings Monte-Carlo method is then carried out to infer relative abundances per sample and per species for every candidate strain model. (h) These models are then evaluated by corrected AICc, and the model with minimum AICc is selected as the optimal model. (i) Finally, the associated strains' relative abundances across samples and their uniGcodes are generated for every species.

In the second operation, individual SNPs are combined into unique SNP profiles from which strains are identified. Previous methods have approached the challenge of identifying individual organisms from microbial communities using SNPs (for example, oligotyping26 and minimum entropy decomposition27), but were limited to SNPs within the span of a sequence read length. ConStrains overcomes this read-length limitation and represents each strain by a barcode-like string of concatenated SNPs spanning hundreds of genes, referred to as the 'uniGcode'. To derive the strain's uniGcodes in a dataset, ConStrains constructs candidate models of strain combinations using a combination of SNP-flow and SNP-type clustering algorithms. The relative abundance of strains in each candidate model across the cohort is estimated sequentially using a Metropolis-Hastings Markov chain Monte-Carlo approach (Fig. 1e–g and Online Methods). Finally, to choose the optimal model with the principle of balancing model fitness and complexity, corrected Akaike information criterion (AICc) is used (Fig. 1h and Online Methods). ConStrains repeats these steps for each species with sufficient coverage, then outputs the number of strains and their respective uniGcodes and relative abundances (Fig. 1i). The uniGcode allows downstream analysis such as cross-sample comparisons and evolutionary studies.

ConStrains identifies strains in large datasets

To validate the performance of ConStrains for strain profiling, we used in silico and host-derived datasets. We generated 36 different sets of k-strain mixtures using in silico genome-based Illumina paired-end read simulation based on ten different Escherichia coli strains whose complete genomes are publicly available, representing real-life scenarios of strain admixtures (k = 2–7; Fig. 2a,b, Supplementary Fig. 3a and Supplementary Table 1). We profiled these 36 sets of reads by ConStrains using default settings, and compared the predicted results with the 'true' strain compositions using Jensen-Shannon divergence (JSD; Fig. 2b and Supplementary Fig. 3b). ConStrains successfully predicted the underlying intraspecies compositions in all 36 datasets (P < 1 × 10−5; two group t-test against random guesses; Fig. 2b), demonstrating a substantial advantage (Supplementary Fig. 4) over reference-based approaches, with an improvement of 0.191 JSD on average (Supplementary Note 1 and Supplementary Fig. 5). In 34 of the 36 sets of reads (94.44%), the numbers of strains inferred exactly matched the ground truth (Fig. 2a), with the remaining two sets of reads having an additional chimeric strain predicted at an extremely low level (<0.1%). We therefore set the recommended detection limit at 0.1% to reduce such errors computationally. As this is a relative abundance threshold, one can still target low-abundance organisms by increasing sequence depth. In similar simulations with up to 30 E. coli strains, ConStrains predicted the strain composition with high confidence when the strain number was less than ten (Fig. 2c), which represents the intraspecific upper bound for more than 95% of metagenomic species (Fig. 2d and Supplementary Note 1). To assess the impact of intraspecies recombination on performance, we generated both real sequencing reads from highly recombined Burkholderia pseudomallei strains28 and in silico–simulated recombinant strain–derived reads, and identified no significant adverse impact (Supplementary Note 1). We also tested the performance in a more realistic metagenomic scenario by embedding E. coli strains in communities with various levels of complexity and found that our approach remained robust (Online Methods, Supplementary Note 2 and Supplementary Table 2). We also found no significant correlation between admixture compositions' alpha diversity and prediction accuracy. These results collectively suggested good algorithm performance (Supplementary Note 1).

Figure 2: ConStrains correctly predicts the strain composition of in silico–simulated datasets in a comparison of true and predicted strain composition profiles of in silico–simulated multistrain mixtures.
figure 2

(a) An increasing number of multistrain mixtures (n = 2–7; rows) analyzed with ConStrains either containing only the target strains (pure) or in the context of a metagenome of low, medium and high complexity (+LC, +MC and +HC, respectively). Colors represent different strains that were mixed in six different ratios (relative abundance) with a Shannon index increasing from top to bottom. In the resulting 144 admixtures, all strains were correctly identified. (b) JSD between predicted composition and the true composition to compare the predictions in abundance for each strain. Blue dashed lines mark the expected errors from random guesses. Boxes mark the interquartile range, red bars mark the interquartile median, whiskers represent the top and the bottom 25% data range, and outliers are marked by crosses. Good performance was obtained for all compositions, with minimal difference in the accuracy of results between pure mixtures and metagenomic mixtures. (c) ConStrains' capacity to correctly infer intraspecific structure as a function of the number of strains contained in a sample. Shown is a typical case with the species' relative abundance ranging from 1% to 5% and a sequencing depth of 100 million paired-end reads. The ConStrains' prediction JSD errors (blue dashed line and boxes) were below 1% of null informative prediction errors (random guess; red dashed line) when the number of strains within a species was less than ten. (d) For comparison, three metagenomic samples were randomly chosen from seven different niches, ranging from adult gut microbiome to a marine planktonic community. More than 95% of the species from these metagenomic samples possessed fewer than ten strains (dashed horizontal line). Dashed lines and whiskers mark the interquartile range; plusses mark the outliers.

We then tested ConStrains using a host-derived metagenomic dataset that had previously been analyzed using a manually curated strain-identification approach. Using manual strain curation, the authors had for the first time described the changes in an infant gut microbiome during the first 24 d of life4. All three manually curated Staphylococcus epidermidis strains reported in this study were successfully predicted by ConStrains in a fully automated manner, with the predicted relative abundances of each strain over time having highly similar values to the original compositions quantified from the scaffold coverage (JSD average = 0.024, s.d. = 0.021; Supplementary Fig. 6). Because the performance of ConStrains' fully automated approach matched well with the manually curated, semiautomated approach described previously4, but required far less machine and manual resources (ConStrains completed the infant gut dataset in 8.5 CPU hours with RAM peak footprint of <2 GB on a Linux server with Xeon 2.6 GHz processors, in contrast to days to weeks of manual curation after assembly), we next applied ConStrains to a very large dataset for which a manual effort would not be feasible (for detailed resource usage, see Supplementary Note 5 and Supplementary Table 3).

In the absence of the existence of such a large dataset (especially one where true results were known), we used a simulated shotgun dataset with intraspecific structure mimicking the natural relative abundance of taxa informed by a recent gut microbiome collection effort for which samples were collected daily over the course of one year29 (Fig. 3a, Online Methods and Supplementary Note 3). ConStrains analyzed 91 species with sufficient depth in the 322 in silico samples. In total, ConStrains analyzed 3.2 terabases of paired-end reads containing 1,361 strains from 320 species, with minimal runtime and infrastructure requirements (Supplementary Note 3). ConStrains achieved high accuracy for individual samples, and also captured crucial information such as dominant strain type changes, for example in Bacteroides fragilis (Fig. 3a–c, Supplementary Table 4 and Supplementary Note 3). This large cohort also enabled us to test factors that might affect the performance of ConStrains, including population complexity, coverage and relatedness. We found that 10× coverage was necessary for accurate profiling and that strain relatedness could also affect performance (Supplementary Fig. 7 and Supplementary Note 3). With this thorough benchmarking, we next applied ConStrains to two previously published clinical datasets to illustrate the biological insights strain-level analyses can provide.

Figure 3: ConStrains scales to large time series and accurately predicts strain dynamics.
figure 3

(ac) In the absence of existing large time series metagenomic datasets, a simulated set with 322 samples was created. Shown are the strain predictions within the Bacteroides fragilis species. The true (a) and ConStrains-predicted (b) relative abundance of B. fragilis strains (stream ribbon width, with different colors representing different strains) in different samples sorted in longitudinal order (sample index) are illustrated. Insets 1–3 in a indicate periods with different dominant strains. Prediction errors (red line) in each sample measured between the true and predicted profiles using JSD (c). For comparison, random guess error (blue line) is shown to indicate a lower performance boundary. Spikes in error rates above 0.1 JSD are mostly related to time points in which the species average coverage drops below 10×, preventing reliable SNP profiling (Supplementary Fig. 7b).

ConStrains reconstructs strain phylogeny

In a published report on the genetic variation of Burkholderia dolosa in cystic fibrosis patients, a selective culturing step had been combined with a deep population sequencing approach30. We reanalyzed that dataset using our ConStrains algorithm and predicted a total of six B. dolosa strains in the population (strains abbreviated as pop-I to pop-VI; Fig. 4a) with an abundance well above 0.1%. We compared the uniGcodes from the six strains inferred by ConStrains with the isolate genome sequence by building a phylogenetic tree, and found that all of the colony strains and two population strains (pop-I and pop-II) were closely related (Fig. 4a). Moreover, the combined relative abundance of pop-I and pop-II represented the majority of the population (51.3% and 27.9% for pop-I and pop-II, respectively). This finding corroborated observations based on the colony sequencing approach. In addition, the ConStrains algorithm identified four additional, less abundant strains (pop-III to pop-VI). None of these strains could be inferred by the colony sequencing approach alone, likely because of their low abundance (11.2%, 8.1%, 1.0% and 0.5%, respectively). To validate these additional predictions, we further examined the polymorphism patterns in these four strains, and compared them against pop-I and pop-II. We found patterns that are unlikely to have resulted from chimeric mixtures of SNPs from pop-I and pop-II (P < 0.01, permutation test; Fig. 4b). This analysis demonstrated that application of ConStrains to cross-sectional datasets, used in addition to a culture-based approach, allows for a comprehensive and efficient discovery of rare strains.

Figure 4: High-sensitivity identification of strain phylogeny using ConStrains in a reanalysis of data from a published study on the genetic variation of B. dolosa populations within cystic fibrosis patients30.
figure 4

(a) Six B. dolosa strains (pop-I to pop-VI) were predicted with an abundance of >0.1% of the species (diameter of green circles proportional to relative abundance). An unrooted neighbor-joining tree on the alignments of the unweighted concatenated SNP profiles for the predicted strains (green circles) and the corresponding genomic data for the 29 cultured isolates (red circles; gray bar indicates the tree distance scale). These results show that the original study retrieved numerous isolates for the two most dominant strains within the population, but could not isolate the lower-abundance strains. Distance between predicted strains and isolates fall within the prediction sensitivity of the ConStrains algorithm (same strain individuals differ with no more than 5% of all SNPs). (b) To demonstrate the sensitivity of the algorithm for differentiating strains, the color-coded allelic difference for each of the predicted strains is shown in reference to the most dominant strain, pop-I. Sites with the same allele as reference (pop-I) were not marked.

Uncovering strain dynamics in infant gut development

We next analyzed an infant gut development dataset containing 54 samples from 9 subjects (indexed subjects 1–9; Fig. 5) collected over the first three years of life (Online Methods and Supplementary Fig. 8) to further explore the ability of ConStrains to reveal strain dynamics. We ran a ConStrains analysis on 75 species that had sufficient sequencing depth for analysis (10×; Fig. 5). Because previously reported strain-detection algorithms had been limited to studying the population consensus sequences12, and ConStrains has the capability to untangle intraspecies diversity, we first examined the number of strains observed in each species. Nearly all species (94.66%) had more than two strains, with an average of 4.88 strains per subject (±1.54 s.d.; Supplementary Fig. 9). By tracking the uniGcode of each strain in separate individuals, we identified several unique strain-level longitudinal patterns. For instance, the population of Fecalibacterium prausnitzii usually comprised several strains that maintained a co-dominant profile, in which the strains maintained the same order of abundance; Ruminococcus gnavus had highly variable behaviors over time, with different strains dominating the intraspecies composition at different time points; in contrast, Bacteroides ovatus contained one dominant strain over time, keeping other strains relatively rare. Bifidobacterium bifidum strains demonstrated comparable dynamic patterns similar to F. prausnitzii; moreover, the strains reestablished the same intraspecific diversity even after the abundance of the species dropped below the detection limit (Fig. 5). We anticipate that the capability of generating better insights in intraspecies dynamics of such health-related species31,32,33 will shed light on the role of these organisms in human physiology.

Figure 5: ConStrains analysis revealed species longitudinal dynamics and functional shifts within an infant gut development cohort.
figure 5

A cohort of nine infants that were sampled throughout the first three years of life, and for which metagenomic data were available for up to nine time points, was analyzed with ConStrains. For 75 species, the depth was sufficient to interpret the underlying strains. The circular tree was constructed using a representative sequence for each species, with the colored outer rings indicating the number of strains observed for each of the nine subjects. Open boxes show the longitudinal dynamics of strains in four selected species; the phylogeny tree inset shows all strains including the available reference genome of B. longum.

With this goal in mind, we pursued our findings in Bifidobacterium longum, an organism linked to human gut health and applied to prevention and treatment of several diseases33. We first observed that the phylogeny of B. longum strains strongly correlated with their host origins (Fig. 5), which indicated strong individuality of B. longum strains. Moreover, in subjects 4 and 6 (Fig. 6a), we observed switches in dominant strain types that were highly correlated with the overall relative abundance of the B. longum species. As previous work has shown that a single operon can affect the competitiveness of different Bacteroides fragilis strains34, we evaluated functional differences between different dominant strains. In both subjects, the different strains dominating during consecutive phases (period 2 in subject 4 and period 1 for subject 6; Fig. 6a) carried additional functions that might be crucial to B. longum's successful colonization of the host gut. In particular, the presence of the human milk oligosaccharide (HMO) utilization cluster has been shown to result from an adaptation to the human infant gut35 (Fig. 6b). Some additional functions might underlie formation of a B. longum bloom, including the presence of the fructose and L-fucose utilization gene clusters (Fig. 6b). Together, these findings might explain why strains with these functions were associated with higher relative abundance of B. longum in the infant gut microbiome. We also observed functions specific to strains that were dominant in periods when B. longum was less abundant (periods 1 and 3 in subject 4 and period 2 in subject 6; Fig. 6a), most notably that the capsular polysaccharide biosynthesis genes were absent from dominant strains when B. longum was more abundant (Fig. 6b). Taken together, strain-level insights provided by ConStrains, combined with functional analyses, could offer candidate targets and hypotheses for future studies.

Figure 6: Functional differences in B. longum strains at different time points during infant gut microbiome development.
figure 6

(a) Two subjects experienced dominant strain switches within the species B. longum (flanking panels, periods marked by numbered gray shadows). Each track in the middle shows the corresponding sample's coverage over the B. longum reference genome. Time points (days after birth) are marked by red triangles. Windows I–IV capture gene content differences before and after dominant strain switches, reflected by the reference genome. (b) The four highlighted regions (I–IV in a) indicate strain-specific functional cohesion that is also strongly associated with B. longum relative abundance in gut microbiome development.

Discussion

We show that the ConStrains algorithm accurately predicts strain-level profiles in large cohorts of metagenomic samples, and that the inferred uniGcodes reconstruct strain phylogeny, within or across cohorts, allowing combined cohort studies. ConStrains is scalable and has minimal resource requirements. In contrast, other approaches14,16,17 are largely dependent on prior knowledge of reference strain genomes, with subspecies resolution being directly dependent on the number of available reference strains per species. This greatly limits the application of such methods on real metagenomic data, because for most of the human microbiome species only one reference genome is available14. Current databases are quickly gaining in intraspecies genome representation, but are still far from saturating natural diversity. With just one genome per species, ConStrains can resolve natural diversity occurring within that species, and is therefore agnostic to unknown strains. Future improvements for strain-level analysis include identification of strains in the absence of any reference genomes. It is conceivable that combining ConStrains with de novo genome assembly from metagenomic data could be an appropriate candidate to overcome this hurdle.

ConStrains is particularly effective for obtaining insights that were previously overlooked using species-level findings (Supplementary Note 4 and Supplementary Figs. 10–12), and will thus enable new types of studies. As we showed with the B. longum example, combining strain-level profiles with reference genome–based gene coverage analysis can reveal candidate genes for understanding strain-specific beneficial effects and the functions that might contribute to successful colonization in the human gut. ConStrains could also identify strains or genes associated with disease and link variable genomic regions to individual strains, a major challenge in shotgun metagenomics. Strain-level profiles, together with appropriate metadata, could link reference-based or de novo–assembled genes with individual strains and help interpret unknown strain-specific functions. Our study of the infant gut development cohort captured HMO utilization cluster enrichment shifts in different development periods, which is particularly important because products of the HMO utilization cluster are essential for B. longum to exert its probiotic effects36. Finally, strain phylogeny could be used across cohorts and add metagenomic means to test fundamental ecological hypotheses, including neutral theory and other adaptive and nonadaptive mechanisms for maintaining sympatric diversity among strains. Although we applied ConStrains to human microbiome datasets, it can also be applied to environmental samples to test fundamental hypotheses about the role of microbes in the environment that can only be addressed at the strain level.

Methods

ConStrains algorithm.

Extracting target species and informative SNPs. With raw reads from samples S1, S2, ..., Sn, ConStrains starts with profiling input metagenomes using MetaPhlAn1 (v1.7) with default settings, with the exception that alignment options are set to 'very-sensitive'; species with average coverage higher than a coverage cutoff (default: 10×) in at least one sample are selected for further strain analysis. For each of the selected species, the corresponding set of the universally conserved genes reported in ref. 1 are used as a database, and Bowtie2 (ref. 24) mapping with default setting is carried out to map reads against those reference genes. Only reads with proper pairing and orientation, no indels, >30 mapping quality, >90 length mapped (overhanging part at gene 5′ and 3′ ends is clipped off before calculation) and at least 95% nucleotide identity with the reference gene are used. These reads are then piled up onto their respective reference sequences using SAMtools25, and the reference gene coverage is subsequently calculated on a per-base basis. To filter out genes with spurious mappings due to hypervariable regions or conserved universal motifs, sites with less than default minimum coverage, as well as those outside of the 1.5 interquartile coverage range across the gene length, are masked. Any gene with more than 30% of its length masked is discarded from further analysis. SNPs are then counted across samples as those unmasked positions where the minor allele had at least two counts or more than 3% in relative abundance.

Strain typing by SNP-flow algorithm. With SNPs extracted, ConStrains first infers the strain composition and their SNP types using the 'SNP-flow' algorithm in per-species, per-sample fashion. In this algorithm, all SNP sites are first hierarchically clustered by the Euclidean distance between the frequencies of different alleles defined as

where a and b are the frequency vector of the four bases sorted in descending order of the respective SNPs. Clusters that contain less than 5% of the overall SNPs or fewer than ten SNPs are discarded. The centroid of each cluster is selected as representative. These SNP cluster centroids (SCCs) are then ranked in descending order based on their weight quantified as the number of SNPs they represent. Finally, a directed graph is constructed using these SCCs, in which nodes are alleles in these SCCs and each is assigned a 'capacity' defined by the allele frequency, and these alleles from neighboring SCCs are connected by edges (Fig. 1e).

In the directed graph constructed in the previous step, nodes are denoted from the same SCC as a layer. With m layers in the graph, SNP-flow will explore all possible combinations of paths from the first layer to the last. This way, every such path represents a strain genotype, and its relative abundance, c, is defined as the lowest node capacity among all nodes on the path. Once a path is visited, all nodes on this path would subtract their capacity by the path's relative abundance c (Fig. 1e). Such a pathfinding and visiting step is repeated until all nodes' capacities are zero, and the visited paths constitute one combination. ConStrains exhausts all possible SNP-type (strains) combinations β = {β1, β2, ..., βk} in each sample with the ith sample's SNP-type βi = bi1bi2...bih where bij is one of the four bases, A, C, G and T, and the associated strain profile αi = (αi1,αi2,...αih) with

For each sample, ConStrains picks the optimal combination that minimizes the fitting error, defined as the discrepancy between expected SNP frequencies and observed frequencies, ɛ, defined as:

where Eij is expected frequency of the ith base at the jth SNP locale; and similarly, Oij is the observed frequency of the ith base at the jth SNP locale in the pileup of aligned reads from the corresponding sample. For instance, C is the second base (i = 2), and if we observed two Cs and eight As at the fifth SNP locale (j = 5) in the pileup of aligned reads against reference, the frequency of C is 0.2 at that position and thus is referred to as O25 = 0.2. Eij is inferred using αi and βi such that

Inferring strain compositions. To unify these optimal SNP types into cohort-wide strains, ConStrains next constructs a neighbor-joining tree of the SNP types from different samples based on sequence percentage identity, and utilizes an internal parameter, Δd, defined as the distance between the tree-cutting point and the leaves, to cut the tree. Rather than using a preset value, the algorithm cuts this tree using all possible Δd. Each internal node created by such a cut could be viewed as the representative of all the children nodes (SNP types) on the tree. In doing so, it identifies all possible k clusters defined by the structure of the tree of SNP types (Fig. 1f), which we refer to as candidate strains.

With the proposed k strains from the previous step, in each sample, we need to find a composition, α*= (α*1, α*2, ..., α*k) with

to minimize the discrepancy between expected SNP frequencies and observed frequencies, ɛ, as defined previously. This is carried out by a Metropolis-Hasting Monte-Carlo method. ConStrains first generates a number of seeds (default: 1,000) at uniform random on k − 1 simplex. The top 50 seeds are then selected and each such seed's vicinity on the k − 1 simplex is iteratively searched. In iteration t, a new point, αtik, is picked within the 0.01 radius of the previous point, αt − 1ik; and it is accepted as the new point with probability min(1, qtik, αt − 1ik)), where qtik, αt − 1ik) = ɛ(αtik)/ɛ(αt − 1ik). It repeats the iteration until |1 − qtik, αt − 1ik)|is smaller than 0.001 or the maximum number of iterations (10,000) is reached. The composition yielding the lowest ɛ is selected as optimal α*ik. ConStrains repeats this step for all samples and all k, yielding solutions for each k, α*k = (α*1, α*2, ..., α*n), with corresponding error (Fig. 1g):

Selecting the optimal strain model. Corrected AICc is employed to select optimal k. The AICc of each k is calculated as:

where L = 1 − ɛk denotes the model likelihood. ConStrains selects the k with the lowest AICc and outputs the associated SNP types and compositions as final results (Fig. 1h). As noted previously, we suggest filtering strains with less than 0.1% in relative abundance as they present a high probability of being chimeric.

In silico datasets.

To simulate in silico single species datasets, 62 complete E. coli genome sequences were downloaded from an NCBI database. Ten genomes were selected and their relatedness was shown by a maximum likelihood tree (Supplementary Fig. 3a) constructed from concatenated nucleotide sequences of core genes among the 10 strains using a method similar to ref. 19. 1,000 random compositions were sampled on a gamma distribution with k = 1 and θ = 0.5 for each number of strains (N = 2–7). In each set of these 1,000 compositions, Shannon entropy was calculated and based on which these compositions were ranked. The compositions on the 15th, 30th,..., 90th percentiles were picked to form a gradient of intraspecific diversity for each N. ART simulator37 was employed to simulate 100× coverage of 100-bp paired-end Illumina reads using these compositions with default settings for Illumina and library settings as “-m 350 -s 50” (Supplementary Fig. 3a). These samples were further grouped together to simulate single strain series samples (Supplementary Table 1).

These simulated E. coli reads were then spiked into in silico–constructed metagenomes to measure the impact from other species. Three human microbiome-like metagenomes with low, medium, and high complexity level (referred as LC, MC and HC, respectively) were simulated based on an aggregated MetaPhlAn1 profile over all 690 Human Microbiome Project (HMP) samples38. E. coli and Shigella were excluded from the profile, and the rest of the species were ranked based on their average abundance in the HMP cohort. The top 20, 50, and 100 most abundant species were selected for LC, MC and HC, respectively. The species composition in each in silico metagenome was calculated as their relative abundance in the HMP cohort, normalized by their total sum. Genomes of these species were downloaded from NCBI, and a representative strain was selected at random if multiple strains of the same species were present. A total of 100 million 100-bp paired-end Illumina reads were simulated for each set by ART simulator37 with the same settings as mentioned previously. Additional datasets for testing the sensitivity and the performance on different numbers of strains and recombined strains were generated in a similar fashion using ART (Supplementary Note 1).

The year-long shotgun metagenome cohort with 322 samples was simulated based on donor A's 16S rRNA amplicon profiles reported in ref. 29. The operational taxonomy unit (OTU) table was used as a guide for community composition in human microbiomes. To allow simulation at the strain level, however, taxonomy in the OTU table was shifted down by one level. For instance, species composition in the original OTU table was shifted to be the strain composition. NCBI draft and complete genomes were used to match as closely as possible the phylogeny of the original OTUs. Reads were then simulated by ART simulator as previously described. The coverage was set to be 1× per 25 read counts in the 16S OTU table.

Biological datasets.

The two infant gut development longitudinal metagenomic datasets used in this study were from a previous study4 and from our recent effort in tracking nine subjects in a three-year period since birth. For the former set, all metagenomic samples were downloaded from NCBI SRA under accession number SRA052203, and the corresponding assembled S. epidermidis strains and phage genomes were downloaded from ggKBase as described4. For the latter set, 54 stool samples were collected from nine infant subjects between September 2008 and August 2010 in Finland. Samples were first collected by the subjects' parents and stored in the household freezer before being transferred on dry ice to a laboratory −80 °C freezer. Samples were then shipped to the Broad Institute for DNA extraction, in which QIAamp DNA Stool Mini Kit (Qiagen, Inc.) was used as described previously39. Library construction was carried out following Human Microbiome Project's standard protocol (http://hmpdacc.org/resources/tools_protocols.php), and 101-bp paired-end reads were produced on an Illumina HiSeq 2000 platform. The raw sequences of these samples are available at SRA under BioProject accession number PRJNA269305, and the corresponding sample information is available in Supplementary Table 5.

Prediction accuracy measurement.

To measure how close the predicted composition, P, is from the true composition, Q, we applied Jenson-Shannon divergence with minor modifications. As it is possible that P and Q are of different dimensions, we first padded the one with lower dimension with zeros to match the one with the higher dimension, and then defined a composition M based on sorted P and Q, P′ and Q′, as:

Therefore the Jenson-Shannon divergence is:

where D(X||Y) is the Kullback-Leibler divergence defined as:

We calculated the SNP typing accuracy as the distance between the inferred SNP tree of strains, kTp, and the true strain tree constructed from concatenated core genes, Tq. First, a distance similar to the symmetric difference introduced by Robinson and Foulds40 was applied to calculate the distance, d, between these two trees. We then normalized d to the expected basal distance from a random tree with the same leaves. The expected basal distance, d, is the mean distance between Tq and 1,000 randomly generated trees with the same leaves.