Abstract
Sequencing-based approaches for the analysis of microbial communities are susceptible to contamination, which could mask biological signals or generate artifactual ones. Methods for in silico decontamination using controls are routinely used, but do not make optimal use of information shared across samples and cannot handle taxa that only partially originate in contamination or leakage of biological material into controls. Here we present Source tracking for Contamination Removal in microBiomes (SCRuB), a probabilistic in silico decontamination method that incorporates shared information across multiple samples and controls to precisely identify and remove contamination. We validate the accuracy of SCRuB in multiple data-driven simulations and experiments, including induced contamination, and demonstrate that it outperforms state-of-the-art methods by an average of 15–20 times. We showcase the robustness of SCRuB across multiple ecosystems, data types and sequencing depths. Demonstrating its applicability to microbiome research, SCRuB facilitates improved predictions of host phenotypes, most notably the prediction of treatment response in melanoma patients using decontaminated tumor microbiome data.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Sequencing data from our experiments, along with all relevant metadata, was uploaded to SRA, accession PRJNA905430 (ref. 55). All other datasets analyzed in this study are publicly available. The college dormitory dataset25 used in Fig. 1 and Extended Data Figs. 3–5 is available from the European Nucleotide Archive (ENA), accession ERP115809, and Qiita41, study ID 12470. The marine sediments dataset, used in Extended Data Fig. 3a,b, is available from Qiita41, study ID 11922. The fish microbiome dataset42, used in Extended Data Fig. 3c,d, is available from ENA, accession PRJEB54736, and Qiita41, study ID 13414. The Earth Microbiome Project soil dataset43, used in Extended Data Fig. 3e,f, is available from ENA, accession PRJEB42019, and Qiita41, study ID 13114. The office dataset44, used in Extended Data Fig. 3g,h, is available from ENA, accession PRJEB13115, and Qiita41, study ID 10423. The Central Park soil dataset45, used in Extended Data Fig. 3i,j, is available from ENA, accession PRJEB6614, and Qiita41, study ID 2104. The gut metagenomic dataset46, used in Extended Data Fig. 3k,l, is available from ENA, accession PRJEB50408, and Qiita41, study ID 13692. The negative controls dataset, used in Fig. 1, and Extended Data Figs. 3a–f, 4, 5 is available from Qiita41, study ID 12019; the one used in Extended Data Fig. 3g,h,k,l is available from ENA, accession PRJEB40903, and Qiita41, study ID 12201; and the one used in Extended Data Fig. 3i,j is available from ENA, accession PRJEB25617, and Qiita41, study ID 10333. The well-to-well leakage dataset32, is available from ENA, accession ERP115213. The plasma cfDNA data20 is available from ENA, accessions ERP119598, ERP119596 and ERP119597; and Qiita41, study IDs 12667, 12691 and 12692. The tumor microbiome dataset18 is available from SRA, accession PRJNA624822. The processed data was obtained from Supplementary Table 2 in ref. 18.
Code availability
SCRuB is available at https://github.com/Shenhav-and-Korem-labs/SCRuB56 and requires R (≥3.6.3), glmnet57 (4.1-4) and torch (1.3.1). A Code Ocean capsule replicating all analyses in this paper is available at https://codeocean.com/capsule/5737862/tree/v1 (ref. 58), with source code also available at https://github.com/Shenhav-and-Korem-labs/SCRuB_analysis. Both use tidyverse59 (0.7.2) and XGBoost60 (1.5.0). The decontamination pipeline used by Nejman et al.18 is available from Zenodo at https://doi.org/10.5281/zenodo.3740536, and the prediction pipeline used by Poore et al.20 is available at https://github.com/biocore/tcga.
References
Salter, S. J. et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 12, 87 (2014).
Weyrich, L. S. et al. Laboratory contamination over time during low-biomass sample analysis. Mol. Ecol. Resour. 19, 982–996 (2019).
Kim, D. et al. Optimizing methods and dodging pitfalls in microbiome research. Microbiome 5, 52 (2017).
Eisenhofer, R. et al. Contamination in low microbial biomass microbiome studies: issues and recommendations. Trends Microbiol. 27, 105–117 (2019).
Weiss, S. et al. Tracking down the sources of experimental contamination in microbiome studies. Genome Biol. 15, 564 (2014).
Aagaard, K. et al. The placenta harbors a unique microbiome. Sci. Transl. Med. 6, 237ra65 (2014).
Parnell, L. A. et al. Microbial communities in placentas from term normal pregnancy exhibit spatially variable profiles. Sci Rep. 7, 11200 (2017).
Seferovic, M. D. et al. Visualization of microbes by 16S in situ hybridization in term and preterm placentas without intraamniotic infection. Am. J. Obstet. Gynecol. 221, 146.e1–146.e23 (2019).
de Goffau, M. C. et al. Human placenta has no microbiome but can contain potential pathogens. Nature 572, 329–334 (2019).
Leiby, J. S. et al. Lack of detection of a human placenta microbiome in samples from preterm and term deliveries. Microbiome 6, 196 (2018).
Kuperman, A. A. et al. Deep microbial analysis of multiple placentas shows no evidence for a placental microbiome. BJOG 127, 159–169 (2020).
Sinha, R., Abnet, C. C., White, O., Knight, R. & Huttenhower, C. The microbiome quality control project: baseline study design and future directions. Genome Biol. 16, 276 (2015).
Edmonds, K. & Williams, L. The role of the negative control in microbiome analyses. FASEB J. 31, 940.3 (2017).
Schierwagen, R. et al. Trust is good, control is better: technical considerations in blood microbiome analysis. Gut 69, 1362–1363 (2020).
de Goffau, M. C. et al. Recognizing the reagent microbiome. Nat Microbiol 3, 851–853 (2018).
van der Horst, J. et al. Sterile paper points as a bacterial DNA-contamination source in microbiome profiles of clinical samples. J. Dent. 41, 1297–1301 (2013).
Olomu, I. N. et al. Elimination of ‘kitome’ and ‘splashome’ contamination results in lack of detection of a unique placental microbiome. BMC Microbiol. 20, 157 (2020).
Nejman, D. et al. The human tumor microbiome is composed of tumor type-specific intracellular bacteria. Science 368, 973–980 (2020).
Pinto-Ribeiro, I. et al. Evaluation of the use of formalin-fixed and paraffin-embedded archive gastric tissues for microbiota characterization using next-generation sequencing. Int. J. Mol. Sci. 21, 1096 (2020).
Poore, G. D. et al. Microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature 579, 567–574 (2020).
Wang, J. et al. Translocation of vaginal microbiota is involved in impairment and protection of uterine health. Nat. Commun. 12, 4191 (2021).
Lam, S. Y. et al. Technical challenges regarding the use of formalin-fixed paraffin embedded (FFPE) tissue specimens for the detection of bacterial alterations in colorectal cancer. BMC Microbiol. 21, 297 (2021).
Allali, I. et al. Gut microbiome compositional and functional differences between tumor and non-tumor adjacent tissues from cohorts from the US and Spain. Gut Microbes 6, 161–172 (2015).
Marotz, C. et al. SARS-CoV-2 detection status associates with bacterial community composition in patients and the hospital environment. Microbiome 9, 132 (2021).
Richardson, M., Gottel, N., Gilbert, J. A. & Lax, S. Microbial similarity between students in a common dormitory environment reveals the forensic potential of individual microbial signatures. mBio 10, e01054-19 (2019).
Chen, Q.-L. et al. Rare microbial taxa as the major drivers of ecosystem multifunctionality in long-term fertilized soils. Soil Biol. Biochem. 141, 107686 (2020).
Smirnova, E., Huzurbazar, S. & Jafari, F. PERFect: PERmutation Filtering test for microbiome data. Biostatistics 20, 615–631 (2019).
Davis, N. M., Proctor, D. M., Holmes, S. P., Relman, D. A. & Callahan, B. J. Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome 6, 226 (2018).
McKnight, D. T. et al. microDecon: a highly accurate read‐subtraction tool for the post‐sequencing removal of contamination in metabarcoding studies. Environ. DNA 1, 14–25 (2019).
Shenhav, L. et al. FEAST: fast expectation-maximization for microbial source tracking. Nat. Methods 16, 627–632 (2019).
Knights, D. et al. Bayesian community-wide culture-independent microbial source tracking. Nat. Methods 8, 761–763 (2011).
Minich, J. J. et al. Quantifying and understanding well-to-well contamination in microbiome research. mSystems 4, e00186-19 (2019).
Lou, Y. C. et al. Using strain-resolved analysis to identify contamination in metagenomics data. Preprint at bioRxiv https://doi.org/10.1101/2022.01.16.476537 (2022).
An, U. et al. STENSL: Microbial Source Tracking with ENvironment SeLection. mSystems 7, e0099521 (2022).
Bolyen, E. et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 37, 852–857 (2019).
Karstens, L. et al. Controlling for contaminants in low-biomass 16S rRNA gene sequencing experiments. mSystems 4, e00290-19 (2019).
Flores, R. et al. Collection media and delayed freezing effects on microbial composition of human stool. Microbiome 3, 33 (2015).
Adams, R. I., Bateman, A. C., Bik, H. M. & Meadow, J. F. Microbiota of the indoor environment: a meta-analysis. Microbiome 3, 49 (2015).
Lou, Y. C. et al. Infant gut strain persistence is associated with maternal origin, phylogeny, and traits including surface adhesion and iron acquisition. Cell Rep. Med. 2, 100393 (2021).
Hornung, B. V. H., Zwittink, R. D. & Kuijper, E. J. Issues and current standards of controls in microbiome research. FEMS Microbiol. Ecol. 95, fiz045 (2019).
Gonzalez, A. et al. Qiita: rapid, web-enabled microbiome meta-analysis. Nat. Methods 15, 796–798 (2018).
Minich, J. J. et al. Host biology, ecology and the environment influence microbial biomass and diversity in 101 marine fish species. Nat. Commun. 13, 6978 (2022).
Shaffer, J. P. et al. Standardized multi-omics of Earth’s microbiomes reveals microbial and metabolite diversity. Nat Microbiol. 7, 2128–2150 (2022).
Chase, J. et al. Geography and location are the primary drivers of office microbiome composition. mSystems 1, e00022-16 (2016).
Ramirez, K. S. et al. Biogeographic patterns in below-ground diversity in New York City’s Central Park are similar to those observed globally. Proc. Biol. Sci. 281, 20141988 (2014).
Hanes, D. et al. The gastrointestinal and microbiome impact of a resistant starch blend from potato, banana, and apple fibers: a randomized clinical trial using smart caps. Front. Nutr. 9, 987216 (2022).
Shaffer, J. P. et al. A comparison of DNA/RNA extraction protocols for high-throughput sequencing of microbial communities. Biotechniques 70, 149–159 (2021).
Ruiz-Calderon, J. F. et al. Walls talk: microbial biogeography of homes spanning urbanization. Sci. Adv. 2, e1501061 (2016).
Robin, X. et al. pROC: an open-source package for R and S to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).
Callahan, B. J. et al. DADA2: high-resolution sample inference from Illumina amplicon data. Nat. Methods 13, 581–583 (2016).
Annavajhala, M. K. et al. Oral and gut microbial diversity and immune regulation in patients with HIV on antiretroviral therapy. mSphere 5, e00798-19 (2020).
Graspeuntner, S., Loeper, N., Künzel, S., Baines, J. F. & Rupp, J. Selection of validated hypervariable regions is crucial in 16S-based microbiota studies of the female genital tract. Sci. Rep. 8, 9678 (2018).
Herlemann, D. P. et al. Transitions in bacterial communities along the 2000 km salinity gradient of the Baltic Sea. ISME J. 5, 1571–1579 (2011).
Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).
Austin, G. I. et al. Contamination benchmark using human-derived samples. NCBI https://www.ncbi.nlm.nih.gov/bioproject/PRJNA905430 (2022).
Austin, G. I., Shenhav, L. & Korem, T. SCRuB. GitHuB https://github.com/Shenhav-and-Korem-labs/SCRuB (2023).
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
Shenhav, L., Korem, T., & Austin, G. Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data. Code Ocean https://doi.org/10.24433/CO.2307706.v1 (2023).
Wickham, H. et al. Welcome to the tidyverse. J. Open Source Softw. 4, 1686 (2019).
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds Krishnapuram, B. et al.) 785–794 (ACM, 2016).
Acknowledgements
We thank members of the Korem group for useful discussions. We are grateful to G. D. Poore, C. Martino, R. Knight, R. Straussman and I. Livyatan for assistance with analyzing and interpreting data from their studies, and to R. Straussman and I. Livyatan for helpful comments on the paper. In general, we thank all authors and participants involved in the generation of all data used in this study. The study was supported by the center for studies in Physics and Biology at Rockefeller University (L.S.), the Program for Mathematical Genomics at Columbia University (T.K.), the CIFAR Azrieli Global Scholarship in the Humans & the Microbiome Program (T.K.), R01HD106017 (T.K.) and R01CA245894 (A.-C.U.).
Author information
Authors and Affiliations
Contributions
G.I.A. wrote SCRuB, and designed and conducted all computational analyses. H.K. designed and conducted all experiments. Y.M. assisted with analyses. D.S. contributed to experiments. T.S. collected samples. A.M.C. supervised sample collection. A.-C.U supervised all experiments. Y.C.L, B.F, M.M and J.F.B assisted in obtaining, analyzing and interpreting data from their study. L.S. and T.K. conceived and designed the study, designed analysis, jointly supervised the study and contributed equally to this work. G.I.A., I.P., L.S. and T.K. interpreted the results and wrote the paper.
Corresponding authors
Ethics declarations
Competing interests
A.-C.U. has received research funding from Merck that is unrelated to this study. The other authors declare no competing interests.
Peer review
Peer review information
Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Empirical validation of the source-tracking assumption in data from Nejman et al.18.
The source-tracking assumption30,31,34 in the context of contamination stipulates that taxa present together in a contamination source will be introduced together to other samples, and in similar proportions as in the contamination source. We demonstrate this empirically using data from Nejamn et al.18. a, The average relative abundance of each ASV (y-axis) across samples from the Netherlands Cancer Institute, plotted against the abundance of the same ASV across negative controls from the same batch (x-axis; ‘No Template Controls’ in Nejman et al.18), separated to ‘high’ and ‘low’ contamination based on SCRuB’s prediction (contamination parameter p > 0.5 and p ≤ 0.5 respectively). Consistent with the source-tracking assumption, taxa present together in a contamination source are introduced together to the samples, and in similar proportions, resulting in a clear positive correlation between the relative abundance of the taxa that are shared between samples and controls (Pearson R = 0.99, P < 10−20 and R = 0.082, P = 0.037 for high and low contamination, respectively). As expected, this correlation varies with respect to SCRuB’s predicted contamination in the samples: samples predicted to have high-contamination (blue) have a slope of 0.97, while those predicted to have low-contamination have a slope of 0.057. b,c, Same as (a) for samples predicted to have the highest (b) and lowest (c) contamination. Pearson R is displayed for panels with >3 shared taxa. Correlation was very high for highly contaminated samples (Pearson R > 0.9, P < 10−4 for all).
Extended Data Fig. 2 Description of our simulation framework.
A visualization of the simulation framework used to benchmark different decontamination methods. We implemented our simulation with the 3 outlined steps: a, We generate a dataset with 88–94 samples, 2, 4 or 8 controls, and a contamination source from an unrelated study, assumed to be biologically distinct from the samples of interest. All samples are then assigned locations across the plate. b, We add well-to-well leakage to the controls, and contamination from the shared source to the samples of interest (Methods). c, We run decontamination using one of several methods (Methods). The decontaminated dataset is evaluated against the ground truth noncontaminated taxonomic compositions using the Jensen-Shannon divergence.
Extended Data Fig. 3 SCRuB outperforms alternative decontamination methods under in silico simulations of diverse environments and data types.
a-l, Same as Fig. 1c, d, but for simulations based on data from 16S amplicon sequencing of tropical marine sediments (Qiita41 study ID 11922; a,b); 16S amplicon sequencing of multiple body sites from southern California fish42 (c,d); 16S amplicon sequencing of soil from the Earth Microbiome Project43 (e,f); ITS sequencing of office samples44 (g,h); 18S amplicon sequencing of soil from Central Park, New York45 (i,j); and human gut metagenomic sequencing46 (k,l). N = 120 simulations per panel. Across almost all simulation scenarios and environments SCRuB outperforms alternative decontamination approaches. Contamination levels were fixed to 5% for the simulations in panels b, d, f, h, j, and l. Box line, median; box, IQR; whiskers, 1.5*IQR; *, one-sided Wilcoxon signed-rank P < 10−4 for comparison between SCRuB and marked method (see Supplementary Table 1 for exact P values).
Extended Data Fig. 4 SCRuB is robust to evaluation metrics and simulation parameters.
a-d, Same as Fig. 1c, d, box and swarm plot (line, median; box, IQR; whiskers, 1.5*IQR) showing the mean (a,b) and standard deviation (c,d) of the Jensen-Shannon divergence (JSD) between the ground truth of each experiment and its decontamination output. SCRuB performs similarly when evaluated using mean JSD, and displays stable standard deviation. e,f, Same as Fig. 1c, d, but with controls placed along the edge of a plate rather than randomly. Similar to Fig. 1c, d, SCRuB outperforms alternative methods under all parameters except no decontamination and microDecon with 50% well-to-well leakage levels. g, Shown are the results from Fig. 1d with well-to-well leakage levels of 5%, stratified by the number of controls (N = 10 experiments per set). SCRuB outperforms alternative decontamination methods regardless of the number of controls (one-sided Wilcoxon signed-rank P < 10−3 for all, P = 0.0029 vs. microDecon with one control). h, Same as Fig. 1d, showing also results from SCRuB running without sample location, and thus without accounting for well-to-well leakage. While SCRuB outperforms SCRuB without sample locations in all simulations (P < 10−4 for all), SCRuB without sample locations still outperforms alternative decontamination methods in many settings. *, one-sided Wilcoxon signed-rank P < 10−3 (panel g) P < 10−4 (otherwise) for comparison between SCRuB (panels a-g) and SCRuB without sample locations (panel h) and the marked method (see Supplementary Table 1 for exact P values). * is on the bottom if the marked method has better performance.
Extended Data Fig. 5 SCRuB is robust to sequencing depth.
Shown are results from in silico simulations under our model (Methods). a, Comparison between experiments in which the read counts of all samples were set to either 1,000, 5,000, 10,000, or 25,000 reads, under contamination and well-to-well leakage levels of 5%. With the exception of the depth of 1,000 reads, SCRuB outperformed the alternative methods in all simulations (one-sided Wilcoxon signed-rank P < 10−3 for all). At a depth of 1,000 reads, SCRuB had comparable performance to decontam (P = 0.19), and significantly outperformed the rest (P < 0.01 for all). b, For each experiment, the mean read depth was set to 10,000, the standard deviation to 2,500, and the contamination and well-to-well leakage levels to 5%. We divided the samples from each experiment into four groups, Q1-Q4, based on the within-experiment quantile to which the read depth of each sample belonged to. Within all groups, SCRuB outperformed alternative decontamination methods (P < 10−3 for all), demonstrating that SCRuB has consistent performance within an experiment with varying read depths. c, Results from experiments with a mean read depth of 10,000, standard deviation of 0, 500, 2,500 or 7,500, and contamination and well-to-well leakage levels of to 5%. Across all standard deviations, SCRuB outperformed competing methods, demonstrating that it is robust to variability in read coverage across experiments. Box line, median; box, IQR; box whiskers, 1.5*IQR; *, one-sided Wilcoxon signed-rank P < 0.01 for comparison between SCRuB and marked method (see Supplementary Table 1 for exact P values).
Extended Data Fig. 6 SCRuB correctly handles unrelated controls.
a, Venn diagram illustrating the taxa removed by each decontamination method, defined as a taxa with an aggregate sum greater than zero in the observed data, and an aggregate sum of zero in the decontaminated data. When presented with unrelated controls, SCRuB removed far fewer taxa than microDecon and either version of decontam, and the majority of taxa removed by SCRuB were also removed by microDecon and decontam (LB). b, Box and swarm plots (line, median; box, IQR; whiskers, 1.5*IQR) showing the median Jensen-Shannon divergence per simulation between simulated samples before and after decontamination with an unrelated control (Methods), across 50 simulated datasets of 88 samples and 8 negative controls. SCRuB is robust to non-informative controls, producing taxonomic compositions that are very close to the original, and significantly closer than alternative methods (one-sided Wilcoxon signed-rank P = 4×10−10, P = 8.8×10−10 and P = 3.8×10−10 between SCRuB and microDecon, decontam or decontam (LB), respectively).
Extended Data Fig. 7 SCRuB correctly accounts for well-to-well leakage.
a, Similar to Fig. 2f, showing the Jensen-Shannon divergence (y-axis) between the ground truth taxonomic composition, as defined by the experimental design of Minich et al.31 (Methods), and the taxonomic composition of the unprocessed dataset (‘No decontamination’), or the dataset following decontamination by various methods (x-axis), and displayed separately for the 31 distinct low-prevalence (left) and 90 high-prevalence (right) monocultures. For low prevalence samples, SCRuB produced estimates that were significantly more similar to the ground truth compared to microDecon, decontam, decontam (LB), and to a restrictive approach (one-sided Wilcoxon P < 10−4 in all cases). For the high prevalence samples, SCRuB performed comparably to decontam and microDecon (P = 0.93, P = 0.12, respectively) and outperformed no decontamination, restrictive, and decontam (LB) (P = 10−8, P = 8.7×10−17 and P = 1.3×10−4, respectively). b-f, A simulation of a more complicated well-to-well leakage experiment, in which each taxa was placed in two monocultures instead of one. To simulate such a scenario, we randomly chose pairs of taxa, and then reassigned all reads assigned to one taxa across the experiment to the other, ‘focal’, taxa. For example, Minich et al. placed E. coli in well C10 (c), resulting in well-to-well leakage (d). We randomly selected well C3, containing a Corynbacterium species, and reassigned all Corynbacterium reads to E. coli (e). We then ran SCRuB on this simulated data, and evaluated the relative abundance of E. coli in its original well (b, f). We performed this 100 times, and examined the relative abundance of the focal taxa in its original well (b). In all cases, SCRuB accurately handled well-to-well leakage in this more complex scenario and avoided removing the taxa belonging to the focal monoculture.
Extended Data Fig. 8 SCRuB correctly infers well-to-well leakage into negative controls in a metagenomic study of infant and maternal microbiomes.
a, The plate design used by Lou et al.33,39, which included a negative control placed in the corner of each extraction plate. Through a strain-level analysis, Lou et al. identified well-to-well leakage into certain negative controls. b, When running SCRuB on each plate, using the MAG abundances of each sample (Methods), we identified well-to-well leakage into the negative control in two of the four plates that included a negative control. c, SCRuB’s predictions of well-to-well leakage were consistent with an assessment based on the results of Lou et al.’s strain-level analysis (Methods).
Extended Data Fig. 9 Well-to-well leakage is more prominent during DNA extraction.
a,b, Plate layout during DNA extraction (a) and library preparation (b) of experiment 2 (Fig. 3a). 10 controls were included in the DNA extraction stage (triangles), and additional 7 in the library preparation stage (hexagon); a pair of each was away from other samples (‘far samples’, purple). c, Box and swarm plot (line, median; box, IQR; whiskers, 1.5*IQR) showing the Jensen-Shannon divergence (y-axis) between human-derived samples adjacent to DNA extraction and library preparation controls and the various controls of each processing stage, stratified by adjacent and near controls (purple in a,b), and calculated from ‘raw’ taxonomic compositions, without any decontamination. Samples are more similar to near than far controls, demonstrating well-to-well leakage occurring during both DNA extraction and library preparation. Samples are also more similar to near extraction controls than to near library controls, suggesting that well-to-well leakage is more prominent during DNA extraction. P, two-sided Mann-Whitney U; N, number of pairwise distances between relevant samples.
Extended Data Fig. 10 SCRuB improves prediction of melanoma and treatment response.
a-f, Receiver operating characteristic (ROC) curves evaluating the pairwise classification accuracy of gradient boosted decision trees on data from patients with lung cancer, prostate cancer, melanoma, and controls, using data from Poore et al.20 Compared to alternative decontamination methods, SCRuB offers classification accuracy that is on-par or improved, and improved accuracy compared to the original analyses in all cases. See Supplementary Table 1 for P values comparing between methods. Shaded area, 95% confidence interval. g, A Venn diagram enumerating the number of taxa completely removed by each decontamination methods applied to the tumor microbiome data from Nejman et al.18 SCRuB removed fewer taxa than alternative methods.
Supplementary information
Supplementary Information
Supplementary Note.
Supplementary Tables
Supplementary Table 1: Exact P values displayed in figures. Supplementary Table 2: Experimental metadata and plate layouts of experiments performed. Refers to experiments described in Fig. 3. Supplementary Table 3: V1–V2 reads in control samples. The number of reads from the V1–V2 regions found in each of the samples from the experiments with human-derived samples (Fig. 3a; Methods). Samples with NA had no reads following DADA2 processing.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Austin, G.I., Park, H., Meydan, Y. et al. Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data. Nat Biotechnol 41, 1820–1828 (2023). https://doi.org/10.1038/s41587-023-01696-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41587-023-01696-w
This article is cited by
-
Robustness of cancer microbiome signals over a broad range of methodological variation
Oncogene (2024)
-
Intracellular bacteria in cancer—prospects and debates
npj Biofilms and Microbiomes (2023)
-
Scrubbing contaminated microbiomes
Nature Reviews Microbiology (2023)