Strain profiling and epidemiology of bacterial species from metagenomic sequencing

Microbial communities are often composed by complex mixtures of multiple strains of the same species, characterized by a wide genomic and phenotypic variability. Computational methods able to identify, quantify and classify the different strains present in a sample are essential to fully exploit the potential of metagenomic sequencing in microbial ecology, with applications that range from the epidemiology of infectious diseases to the characterization of the dynamics of microbial colonization. Here we present a computational approach that uses the available genomic data to reconstruct complex strain profiles from metagenomic sequencing, quantifying the abundances of the different strains and cataloging them according to the population structure of the species. We validate the method on synthetic data sets and apply it to the characterization of the strain distribution of several important bacterial species in real samples, showing how its application provides novel insights on the structure and complexity of the microbiota.

. sytheticII dataset -B. longum. Matthew Correlation Coefficient of the strains predicted by StrainEst for the 12 different samples with relative abundances 90%-10% (left column), 70%-30% (center column) and 50%-50% (right column) and coverage 10X (top row), 20X (second top row), 50X (third row), and 100X (bottom row). Strains are considered predicted positive if their predicted relative abundance exceeds a given threshold. The plotted data are for values of the threshold between 0.01 and 0.2. Boxes indicate the 25% and 75% percentiles while whiskers extend to the highest (lowest) value that is within 1.5 times the inter-quartile range. Outliers are shown as grey dots.
Supplementary Fig. 7. LOOEcoli dataset. Performances of StrainEst in the analysis of a metagenomic samples containing one strain that is absent from the reference database. Median mash distance between the predicted dominant E. coli strain and the actual (a), median estimated relative abundance of the dominant strain (b) and percentage of unclassified metagenomes (c) for three different SNV profile identity thresholds (parameterd/--max-ident-thr). Error bars indicate the first and the third quartile. In all cases, StrainEst identified a dominant strain that was closely related to the actual. However, using the default value of the compatibility threshold StrainEst overestimated the sample complexity in an attempt to compensate for the missing strain. As the threshold increased, the accuracy of the prediction increased, but the number of predictable metagenomes decreased.
Supplementary Fig. 8. P. acnes Neighbor Joining tree using Mash distances. Large dots depict the representative strains after the SNV clustering steps. Colors indicates cluster membership.
Supplementary Fig. 9. Frequency distribution of the allelic variants of P. acnes Subject HV01, Hp site for three different timepoints (T1,T2,T3). Transition from the low diversity (T1) to the high diversity (T2, T3) phenotype. In this example, the Phylogenetic Diversity increases from 0.02 (T1) to 0.043 (T2,T3). While the bi-modal frequency distribution of the allelic variants is indicative of the presence of a single strain at T1, multiple peaks appear at T2 and T3, supporting the presence of a more complex population. For clarity, the y-axis range is truncated at 3%.
Supplementary Fig. 10. HMP oral dataset. Principal Coordinate Analysis (PCoA) using the Weighted UniFrac distances computed on the predicted relative abundances of species within the Neisseria genus and the phylogenetic tree estimated with the neighbor-joining method on the Mash distances. Samples with a reconstruction Pearson coefficient R<0.8 were removed from the analysis. Supplementary Table 6. Selection of the representative genomes for SNV profiling. The Mash distance threshold from the species representative is the threshold used for the preliminary clustering from the pairwise Mash distance matrix (see Fig. 1a, main text). This clustering yields a set of representative genomes that are aligned against the species representative using NUCmer to identify the core genome and the set of SNVs. Reference SNV profiles are finally clustered obtaining the SNV matrix used in the modeling step (see Fig. 1b and 1c, main text).

Comparison to existing tools
To compare the performances of StrainEst to existing tools, we run ConStrains, PanPhlAn, PathoScope, Sigma, and Bowtie 2 on the 50 independent samples of the syntheticEcoli dataset.