Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data

Austin, George I.; Park, Heekuk; Meydan, Yoli; Seeram, Dwayne; Sezin, Tanya; Lou, Yue Clare; Firek, Brian A.; Morowitz, Michael J.; Banfield, Jillian F.; Christiano, Angela M.; Pe’er, Itsik; Uhlemann, Anne-Catrin; Shenhav, Liat; Korem, Tal

doi:10.1038/s41587-023-01696-w

Article
Published: 16 March 2023

Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data

Nature Biotechnology volume 41, pages 1820–1828 (2023)Cite this article

8054 Accesses
9 Citations
111 Altmetric
Metrics details

Subjects

Abstract

Sequencing-based approaches for the analysis of microbial communities are susceptible to contamination, which could mask biological signals or generate artifactual ones. Methods for in silico decontamination using controls are routinely used, but do not make optimal use of information shared across samples and cannot handle taxa that only partially originate in contamination or leakage of biological material into controls. Here we present Source tracking for Contamination Removal in microBiomes (SCRuB), a probabilistic in silico decontamination method that incorporates shared information across multiple samples and controls to precisely identify and remove contamination. We validate the accuracy of SCRuB in multiple data-driven simulations and experiments, including induced contamination, and demonstrate that it outperforms state-of-the-art methods by an average of 15–20 times. We showcase the robustness of SCRuB across multiple ecosystems, data types and sequencing depths. Demonstrating its applicability to microbiome research, SCRuB facilitates improved predictions of host phenotypes, most notably the prediction of treatment response in melanoma patients using decontaminated tumor microbiome data.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: SCRuB demonstrates superior decontamination in simulated benchmarks.**

**Fig. 2: SCRuB correctly accounts for well-to-well leakage.**

**Fig. 3: SCRuB outperforms alternative decontamination methods in a benchmark with human-derived samples.**

**Fig. 4: SCRuB improves the prediction of melanoma and treatment response.**

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

A distinct Fusobacterium nucleatum clade dominates the colorectal cancer niche

Article Open access 20 March 2024

Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations

Article Open access 09 April 2024

Data availability

Sequencing data from our experiments, along with all relevant metadata, was uploaded to SRA, accession PRJNA905430 (ref. ⁵⁵). All other datasets analyzed in this study are publicly available. The college dormitory dataset²⁵ used in Fig. 1 and Extended Data Figs. 3–5 is available from the European Nucleotide Archive (ENA), accession ERP115809, and Qiita⁴¹, study ID 12470. The marine sediments dataset, used in Extended Data Fig. 3a,b, is available from Qiita⁴¹, study ID 11922. The fish microbiome dataset⁴², used in Extended Data Fig. 3c,d, is available from ENA, accession PRJEB54736, and Qiita⁴¹, study ID 13414. The Earth Microbiome Project soil dataset⁴³, used in Extended Data Fig. 3e,f, is available from ENA, accession PRJEB42019, and Qiita⁴¹, study ID 13114. The office dataset⁴⁴, used in Extended Data Fig. 3g,h, is available from ENA, accession PRJEB13115, and Qiita⁴¹, study ID 10423. The Central Park soil dataset⁴⁵, used in Extended Data Fig. 3i,j, is available from ENA, accession PRJEB6614, and Qiita⁴¹, study ID 2104. The gut metagenomic dataset⁴⁶, used in Extended Data Fig. 3k,l, is available from ENA, accession PRJEB50408, and Qiita⁴¹, study ID 13692. The negative controls dataset, used in Fig. 1, and Extended Data Figs. 3a–f, 4, 5 is available from Qiita⁴¹, study ID 12019; the one used in Extended Data Fig. 3g,h,k,l is available from ENA, accession PRJEB40903, and Qiita⁴¹, study ID 12201; and the one used in Extended Data Fig. 3i,j is available from ENA, accession PRJEB25617, and Qiita⁴¹, study ID 10333. The well-to-well leakage dataset³², is available from ENA, accession ERP115213. The plasma cfDNA data²⁰ is available from ENA, accessions ERP119598, ERP119596 and ERP119597; and Qiita⁴¹, study IDs 12667, 12691 and 12692. The tumor microbiome dataset¹⁸ is available from SRA, accession PRJNA624822. The processed data was obtained from Supplementary Table 2 in ref. ¹⁸.

Code availability

SCRuB is available at https://github.com/Shenhav-and-Korem-labs/SCRuB⁵⁶ and requires R (≥3.6.3), glmnet⁵⁷ (4.1-4) and torch (1.3.1). A Code Ocean capsule replicating all analyses in this paper is available at https://codeocean.com/capsule/5737862/tree/v1 (ref. ⁵⁸), with source code also available at https://github.com/Shenhav-and-Korem-labs/SCRuB_analysis. Both use tidyverse⁵⁹ (0.7.2) and XGBoost⁶⁰ (1.5.0). The decontamination pipeline used by Nejman et al.¹⁸ is available from Zenodo at https://doi.org/10.5281/zenodo.3740536, and the prediction pipeline used by Poore et al.²⁰ is available at https://github.com/biocore/tcga.

References

Salter, S. J. et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 12, 87 (2014).
PubMed PubMed Central Google Scholar
Weyrich, L. S. et al. Laboratory contamination over time during low-biomass sample analysis. Mol. Ecol. Resour. 19, 982–996 (2019).
CAS PubMed PubMed Central Google Scholar
Kim, D. et al. Optimizing methods and dodging pitfalls in microbiome research. Microbiome 5, 52 (2017).
PubMed PubMed Central Google Scholar
Eisenhofer, R. et al. Contamination in low microbial biomass microbiome studies: issues and recommendations. Trends Microbiol. 27, 105–117 (2019).
CAS PubMed Google Scholar
Weiss, S. et al. Tracking down the sources of experimental contamination in microbiome studies. Genome Biol. 15, 564 (2014).
PubMed PubMed Central Google Scholar
Aagaard, K. et al. The placenta harbors a unique microbiome. Sci. Transl. Med. 6, 237ra65 (2014).
PubMed PubMed Central Google Scholar
Parnell, L. A. et al. Microbial communities in placentas from term normal pregnancy exhibit spatially variable profiles. Sci Rep. 7, 11200 (2017).
PubMed PubMed Central Google Scholar
Seferovic, M. D. et al. Visualization of microbes by 16S in situ hybridization in term and preterm placentas without intraamniotic infection. Am. J. Obstet. Gynecol. 221, 146.e1–146.e23 (2019).
CAS PubMed Google Scholar
de Goffau, M. C. et al. Human placenta has no microbiome but can contain potential pathogens. Nature 572, 329–334 (2019).
PubMed PubMed Central Google Scholar
Leiby, J. S. et al. Lack of detection of a human placenta microbiome in samples from preterm and term deliveries. Microbiome 6, 196 (2018).
PubMed PubMed Central Google Scholar
Kuperman, A. A. et al. Deep microbial analysis of multiple placentas shows no evidence for a placental microbiome. BJOG 127, 159–169 (2020).
CAS PubMed Google Scholar
Sinha, R., Abnet, C. C., White, O., Knight, R. & Huttenhower, C. The microbiome quality control project: baseline study design and future directions. Genome Biol. 16, 276 (2015).
PubMed PubMed Central Google Scholar
Edmonds, K. & Williams, L. The role of the negative control in microbiome analyses. FASEB J. 31, 940.3 (2017).
Google Scholar
Schierwagen, R. et al. Trust is good, control is better: technical considerations in blood microbiome analysis. Gut 69, 1362–1363 (2020).
PubMed Google Scholar
de Goffau, M. C. et al. Recognizing the reagent microbiome. Nat Microbiol 3, 851–853 (2018).
PubMed Google Scholar
van der Horst, J. et al. Sterile paper points as a bacterial DNA-contamination source in microbiome profiles of clinical samples. J. Dent. 41, 1297–1301 (2013).
PubMed Google Scholar
Olomu, I. N. et al. Elimination of ‘kitome’ and ‘splashome’ contamination results in lack of detection of a unique placental microbiome. BMC Microbiol. 20, 157 (2020).
CAS PubMed PubMed Central Google Scholar
Nejman, D. et al. The human tumor microbiome is composed of tumor type-specific intracellular bacteria. Science 368, 973–980 (2020).
CAS PubMed PubMed Central Google Scholar
Pinto-Ribeiro, I. et al. Evaluation of the use of formalin-fixed and paraffin-embedded archive gastric tissues for microbiota characterization using next-generation sequencing. Int. J. Mol. Sci. 21, 1096 (2020).
CAS PubMed PubMed Central Google Scholar
Poore, G. D. et al. Microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature 579, 567–574 (2020).
CAS PubMed PubMed Central Google Scholar
Wang, J. et al. Translocation of vaginal microbiota is involved in impairment and protection of uterine health. Nat. Commun. 12, 4191 (2021).
CAS PubMed PubMed Central Google Scholar
Lam, S. Y. et al. Technical challenges regarding the use of formalin-fixed paraffin embedded (FFPE) tissue specimens for the detection of bacterial alterations in colorectal cancer. BMC Microbiol. 21, 297 (2021).
CAS PubMed PubMed Central Google Scholar
Allali, I. et al. Gut microbiome compositional and functional differences between tumor and non-tumor adjacent tissues from cohorts from the US and Spain. Gut Microbes 6, 161–172 (2015).
PubMed PubMed Central Google Scholar
Marotz, C. et al. SARS-CoV-2 detection status associates with bacterial community composition in patients and the hospital environment. Microbiome 9, 132 (2021).
CAS PubMed PubMed Central Google Scholar
Richardson, M., Gottel, N., Gilbert, J. A. & Lax, S. Microbial similarity between students in a common dormitory environment reveals the forensic potential of individual microbial signatures. mBio 10, e01054-19 (2019).
PubMed PubMed Central Google Scholar
Chen, Q.-L. et al. Rare microbial taxa as the major drivers of ecosystem multifunctionality in long-term fertilized soils. Soil Biol. Biochem. 141, 107686 (2020).
CAS Google Scholar
Smirnova, E., Huzurbazar, S. & Jafari, F. PERFect: PERmutation Filtering test for microbiome data. Biostatistics 20, 615–631 (2019).
PubMed Google Scholar
Davis, N. M., Proctor, D. M., Holmes, S. P., Relman, D. A. & Callahan, B. J. Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome 6, 226 (2018).
PubMed PubMed Central Google Scholar
McKnight, D. T. et al. microDecon: a highly accurate read‐subtraction tool for the post‐sequencing removal of contamination in metabarcoding studies. Environ. DNA 1, 14–25 (2019).
Google Scholar
Shenhav, L. et al. FEAST: fast expectation-maximization for microbial source tracking. Nat. Methods 16, 627–632 (2019).
CAS PubMed PubMed Central Google Scholar
Knights, D. et al. Bayesian community-wide culture-independent microbial source tracking. Nat. Methods 8, 761–763 (2011).
CAS PubMed PubMed Central Google Scholar
Minich, J. J. et al. Quantifying and understanding well-to-well contamination in microbiome research. mSystems 4, e00186-19 (2019).
PubMed PubMed Central Google Scholar
Lou, Y. C. et al. Using strain-resolved analysis to identify contamination in metagenomics data. Preprint at bioRxiv https://doi.org/10.1101/2022.01.16.476537 (2022).
An, U. et al. STENSL: Microbial Source Tracking with ENvironment SeLection. mSystems 7, e0099521 (2022).
PubMed Google Scholar
Bolyen, E. et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 37, 852–857 (2019).
CAS PubMed PubMed Central Google Scholar
Karstens, L. et al. Controlling for contaminants in low-biomass 16S rRNA gene sequencing experiments. mSystems 4, e00290-19 (2019).
PubMed PubMed Central Google Scholar
Flores, R. et al. Collection media and delayed freezing effects on microbial composition of human stool. Microbiome 3, 33 (2015).
PubMed PubMed Central Google Scholar
Adams, R. I., Bateman, A. C., Bik, H. M. & Meadow, J. F. Microbiota of the indoor environment: a meta-analysis. Microbiome 3, 49 (2015).
PubMed PubMed Central Google Scholar
Lou, Y. C. et al. Infant gut strain persistence is associated with maternal origin, phylogeny, and traits including surface adhesion and iron acquisition. Cell Rep. Med. 2, 100393 (2021).
CAS PubMed PubMed Central Google Scholar
Hornung, B. V. H., Zwittink, R. D. & Kuijper, E. J. Issues and current standards of controls in microbiome research. FEMS Microbiol. Ecol. 95, fiz045 (2019).
CAS PubMed PubMed Central Google Scholar
Gonzalez, A. et al. Qiita: rapid, web-enabled microbiome meta-analysis. Nat. Methods 15, 796–798 (2018).
CAS PubMed PubMed Central Google Scholar
Minich, J. J. et al. Host biology, ecology and the environment influence microbial biomass and diversity in 101 marine fish species. Nat. Commun. 13, 6978 (2022).
CAS PubMed PubMed Central Google Scholar
Shaffer, J. P. et al. Standardized multi-omics of Earth’s microbiomes reveals microbial and metabolite diversity. Nat Microbiol. 7, 2128–2150 (2022).
CAS PubMed PubMed Central Google Scholar
Chase, J. et al. Geography and location are the primary drivers of office microbiome composition. mSystems 1, e00022-16 (2016).
PubMed PubMed Central Google Scholar
Ramirez, K. S. et al. Biogeographic patterns in below-ground diversity in New York City’s Central Park are similar to those observed globally. Proc. Biol. Sci. 281, 20141988 (2014).
PubMed PubMed Central Google Scholar
Hanes, D. et al. The gastrointestinal and microbiome impact of a resistant starch blend from potato, banana, and apple fibers: a randomized clinical trial using smart caps. Front. Nutr. 9, 987216 (2022).
PubMed PubMed Central Google Scholar
Shaffer, J. P. et al. A comparison of DNA/RNA extraction protocols for high-throughput sequencing of microbial communities. Biotechniques 70, 149–159 (2021).
CAS PubMed PubMed Central Google Scholar
Ruiz-Calderon, J. F. et al. Walls talk: microbial biogeography of homes spanning urbanization. Sci. Adv. 2, e1501061 (2016).
PubMed PubMed Central Google Scholar
Robin, X. et al. pROC: an open-source package for R and S to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).
PubMed PubMed Central Google Scholar
Callahan, B. J. et al. DADA2: high-resolution sample inference from Illumina amplicon data. Nat. Methods 13, 581–583 (2016).
CAS PubMed PubMed Central Google Scholar
Annavajhala, M. K. et al. Oral and gut microbial diversity and immune regulation in patients with HIV on antiretroviral therapy. mSphere 5, e00798-19 (2020).
PubMed PubMed Central Google Scholar
Graspeuntner, S., Loeper, N., Künzel, S., Baines, J. F. & Rupp, J. Selection of validated hypervariable regions is crucial in 16S-based microbiota studies of the female genital tract. Sci. Rep. 8, 9678 (2018).
PubMed PubMed Central Google Scholar
Herlemann, D. P. et al. Transitions in bacterial communities along the 2000 km salinity gradient of the Baltic Sea. ISME J. 5, 1571–1579 (2011).
CAS PubMed PubMed Central Google Scholar
Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).
PubMed PubMed Central Google Scholar
Austin, G. I. et al. Contamination benchmark using human-derived samples. NCBI https://www.ncbi.nlm.nih.gov/bioproject/PRJNA905430 (2022).
Austin, G. I., Shenhav, L. & Korem, T. SCRuB. GitHuB https://github.com/Shenhav-and-Korem-labs/SCRuB (2023).
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
PubMed PubMed Central Google Scholar
Shenhav, L., Korem, T., & Austin, G. Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data. Code Ocean https://doi.org/10.24433/CO.2307706.v1 (2023).
Wickham, H. et al. Welcome to the tidyverse. J. Open Source Softw. 4, 1686 (2019).
Google Scholar
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds Krishnapuram, B. et al.) 785–794 (ACM, 2016).

Download references

Acknowledgements

We thank members of the Korem group for useful discussions. We are grateful to G. D. Poore, C. Martino, R. Knight, R. Straussman and I. Livyatan for assistance with analyzing and interpreting data from their studies, and to R. Straussman and I. Livyatan for helpful comments on the paper. In general, we thank all authors and participants involved in the generation of all data used in this study. The study was supported by the center for studies in Physics and Biology at Rockefeller University (L.S.), the Program for Mathematical Genomics at Columbia University (T.K.), the CIFAR Azrieli Global Scholarship in the Humans & the Microbiome Program (T.K.), R01HD106017 (T.K.) and R01CA245894 (A.-C.U.).

Author information

These authors contributed equally: Liat Shenhav, Tal Korem.

Authors and Affiliations

Department of Computer Science, Columbia University, New York, NY, USA
George I. Austin & Itsik Pe’er
Program for Mathematical Genomics, Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
George I. Austin, Yoli Meydan, Itsik Pe’er & Tal Korem
Division of Infectious Diseases, Columbia University Irving Medical Center, New York, NY, USA
Heekuk Park, Dwayne Seeram & Anne-Catrin Uhlemann
Department of Dermatology, Columbia University Irving Medical Center, New York, NY, USA
Tanya Sezin & Angela M. Christiano
Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
Yue Clare Lou
Department of Surgery, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
Brian A. Firek & Michael J. Morowitz
Department of Earth and Planetary Science, University of California, Berkeley, CA, USA
Jillian F. Banfield
Department of Environmental Science, Policy, and Management, University of California, Berkeley, CA, USA
Jillian F. Banfield
Innovative Genomics Institute, University of California, Berkeley, CA, USA
Jillian F. Banfield
Chan Zuckerberg Biohub, San Francisco, CA, USA
Jillian F. Banfield
Department of Genetics and Development, Columbia University Irving Medical Center, New York, NY, USA
Angela M. Christiano
Data Science Institute, Columbia University, New York, NY, USA
Itsik Pe’er
Center for Studies in Physics and Biology, Rockefeller University, New York, NY, USA
Liat Shenhav
Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
Tal Korem
CIFAR Azrieli Global Scholars program, CIFAR, Toronto, Canada
Tal Korem

Authors

George I. Austin
View author publications
You can also search for this author in PubMed Google Scholar
Heekuk Park
View author publications
You can also search for this author in PubMed Google Scholar
Yoli Meydan
View author publications
You can also search for this author in PubMed Google Scholar
Dwayne Seeram
View author publications
You can also search for this author in PubMed Google Scholar
Tanya Sezin
View author publications
You can also search for this author in PubMed Google Scholar
Yue Clare Lou
View author publications
You can also search for this author in PubMed Google Scholar
Brian A. Firek
View author publications
You can also search for this author in PubMed Google Scholar
Michael J. Morowitz
View author publications
You can also search for this author in PubMed Google Scholar
Jillian F. Banfield
View author publications
You can also search for this author in PubMed Google Scholar
Angela M. Christiano
View author publications
You can also search for this author in PubMed Google Scholar
Itsik Pe’er
View author publications
You can also search for this author in PubMed Google Scholar
Anne-Catrin Uhlemann
View author publications
You can also search for this author in PubMed Google Scholar
Liat Shenhav
View author publications
You can also search for this author in PubMed Google Scholar
Tal Korem
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.I.A. wrote SCRuB, and designed and conducted all computational analyses. H.K. designed and conducted all experiments. Y.M. assisted with analyses. D.S. contributed to experiments. T.S. collected samples. A.M.C. supervised sample collection. A.-C.U supervised all experiments. Y.C.L, B.F, M.M and J.F.B assisted in obtaining, analyzing and interpreting data from their study. L.S. and T.K. conceived and designed the study, designed analysis, jointly supervised the study and contributed equally to this work. G.I.A., I.P., L.S. and T.K. interpreted the results and wrote the paper.

Corresponding authors

Correspondence to Liat Shenhav or Tal Korem.

Ethics declarations

Competing interests

A.-C.U. has received research funding from Merck that is unrelated to this study. The other authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Empirical validation of the source-tracking assumption in data from Nejman et al.18.

The source-tracking assumption^30,31,34 in the context of contamination stipulates that taxa present together in a contamination source will be introduced together to other samples, and in similar proportions as in the contamination source. We demonstrate this empirically using data from Nejamn et al.¹⁸. a, The average relative abundance of each ASV (y-axis) across samples from the Netherlands Cancer Institute, plotted against the abundance of the same ASV across negative controls from the same batch (x-axis; ‘No Template Controls’ in Nejman et al.¹⁸), separated to ‘high’ and ‘low’ contamination based on SCRuB’s prediction (contamination parameter p > 0.5 and p ≤ 0.5 respectively). Consistent with the source-tracking assumption, taxa present together in a contamination source are introduced together to the samples, and in similar proportions, resulting in a clear positive correlation between the relative abundance of the taxa that are shared between samples and controls (Pearson R = 0.99, P < 10⁻²⁰ and R = 0.082, P = 0.037 for high and low contamination, respectively). As expected, this correlation varies with respect to SCRuB’s predicted contamination in the samples: samples predicted to have high-contamination (blue) have a slope of 0.97, while those predicted to have low-contamination have a slope of 0.057. b,c, Same as (a) for samples predicted to have the highest (b) and lowest (c) contamination. Pearson R is displayed for panels with >3 shared taxa. Correlation was very high for highly contaminated samples (Pearson R > 0.9, P < 10⁻⁴ for all).

Extended Data Fig. 2 Description of our simulation framework.

A visualization of the simulation framework used to benchmark different decontamination methods. We implemented our simulation with the 3 outlined steps: a, We generate a dataset with 88–94 samples, 2, 4 or 8 controls, and a contamination source from an unrelated study, assumed to be biologically distinct from the samples of interest. All samples are then assigned locations across the plate. b, We add well-to-well leakage to the controls, and contamination from the shared source to the samples of interest (Methods). c, We run decontamination using one of several methods (Methods). The decontaminated dataset is evaluated against the ground truth noncontaminated taxonomic compositions using the Jensen-Shannon divergence.

Extended Data Fig. 3 SCRuB outperforms alternative decontamination methods under in silico simulations of diverse environments and data types.

a-l, Same as Fig. 1c, d, but for simulations based on data from 16S amplicon sequencing of tropical marine sediments (Qiita⁴¹ study ID 11922; a,b); 16S amplicon sequencing of multiple body sites from southern California fish⁴² (c,d); 16S amplicon sequencing of soil from the Earth Microbiome Project⁴³ (e,f); ITS sequencing of office samples⁴⁴ (g,h); 18S amplicon sequencing of soil from Central Park, New York⁴⁵ (i,j); and human gut metagenomic sequencing⁴⁶ (k,l). N = 120 simulations per panel. Across almost all simulation scenarios and environments SCRuB outperforms alternative decontamination approaches. Contamination levels were fixed to 5% for the simulations in panels b, d, f, h, j, and l. Box line, median; box, IQR; whiskers, 1.5*IQR; *, one-sided Wilcoxon signed-rank P < 10⁻⁴ for comparison between SCRuB and marked method (see Supplementary Table 1 for exact P values).

Extended Data Fig. 4 SCRuB is robust to evaluation metrics and simulation parameters.

a-d, Same as Fig. 1c, d, box and swarm plot (line, median; box, IQR; whiskers, 1.5*IQR) showing the mean (a,b) and standard deviation (c,d) of the Jensen-Shannon divergence (JSD) between the ground truth of each experiment and its decontamination output. SCRuB performs similarly when evaluated using mean JSD, and displays stable standard deviation. e,f, Same as Fig. 1c, d, but with controls placed along the edge of a plate rather than randomly. Similar to Fig. 1c, d, SCRuB outperforms alternative methods under all parameters except no decontamination and microDecon with 50% well-to-well leakage levels. g, Shown are the results from Fig. 1d with well-to-well leakage levels of 5%, stratified by the number of controls (N = 10 experiments per set). SCRuB outperforms alternative decontamination methods regardless of the number of controls (one-sided Wilcoxon signed-rank P < 10⁻³ for all, P = 0.0029 vs. microDecon with one control). h, Same as Fig. 1d, showing also results from SCRuB running without sample location, and thus without accounting for well-to-well leakage. While SCRuB outperforms SCRuB without sample locations in all simulations (P < 10⁻⁴ for all), SCRuB without sample locations still outperforms alternative decontamination methods in many settings. *, one-sided Wilcoxon signed-rank P < 10⁻³ (panel g) P < 10⁻⁴ (otherwise) for comparison between SCRuB (panels a-g) and SCRuB without sample locations (panel h) and the marked method (see Supplementary Table 1 for exact P values). * is on the bottom if the marked method has better performance.

Extended Data Fig. 5 SCRuB is robust to sequencing depth.

Shown are results from in silico simulations under our model (Methods). a, Comparison between experiments in which the read counts of all samples were set to either 1,000, 5,000, 10,000, or 25,000 reads, under contamination and well-to-well leakage levels of 5%. With the exception of the depth of 1,000 reads, SCRuB outperformed the alternative methods in all simulations (one-sided Wilcoxon signed-rank P < 10⁻³ for all). At a depth of 1,000 reads, SCRuB had comparable performance to decontam (P = 0.19), and significantly outperformed the rest (P < 0.01 for all). b, For each experiment, the mean read depth was set to 10,000, the standard deviation to 2,500, and the contamination and well-to-well leakage levels to 5%. We divided the samples from each experiment into four groups, Q1-Q4, based on the within-experiment quantile to which the read depth of each sample belonged to. Within all groups, SCRuB outperformed alternative decontamination methods (P < 10⁻³ for all), demonstrating that SCRuB has consistent performance within an experiment with varying read depths. c, Results from experiments with a mean read depth of 10,000, standard deviation of 0, 500, 2,500 or 7,500, and contamination and well-to-well leakage levels of to 5%. Across all standard deviations, SCRuB outperformed competing methods, demonstrating that it is robust to variability in read coverage across experiments. Box line, median; box, IQR; box whiskers, 1.5*IQR; *, one-sided Wilcoxon signed-rank P < 0.01 for comparison between SCRuB and marked method (see Supplementary Table 1 for exact P values).

Extended Data Fig. 6 SCRuB correctly handles unrelated controls.

a, Venn diagram illustrating the taxa removed by each decontamination method, defined as a taxa with an aggregate sum greater than zero in the observed data, and an aggregate sum of zero in the decontaminated data. When presented with unrelated controls, SCRuB removed far fewer taxa than microDecon and either version of decontam, and the majority of taxa removed by SCRuB were also removed by microDecon and decontam (LB). b, Box and swarm plots (line, median; box, IQR; whiskers, 1.5*IQR) showing the median Jensen-Shannon divergence per simulation between simulated samples before and after decontamination with an unrelated control (Methods), across 50 simulated datasets of 88 samples and 8 negative controls. SCRuB is robust to non-informative controls, producing taxonomic compositions that are very close to the original, and significantly closer than alternative methods (one-sided Wilcoxon signed-rank P = 4×10⁻¹⁰, P = 8.8×10⁻¹⁰ and P = 3.8×10⁻¹⁰ between SCRuB and microDecon, decontam or decontam (LB), respectively).

Extended Data Fig. 7 SCRuB correctly accounts for well-to-well leakage.

a, Similar to Fig. 2f, showing the Jensen-Shannon divergence (y-axis) between the ground truth taxonomic composition, as defined by the experimental design of Minich et al.³¹ (Methods), and the taxonomic composition of the unprocessed dataset (‘No decontamination’), or the dataset following decontamination by various methods (x-axis), and displayed separately for the 31 distinct low-prevalence (left) and 90 high-prevalence (right) monocultures. For low prevalence samples, SCRuB produced estimates that were significantly more similar to the ground truth compared to microDecon, decontam, decontam (LB), and to a restrictive approach (one-sided Wilcoxon P < 10⁻⁴ in all cases). For the high prevalence samples, SCRuB performed comparably to decontam and microDecon (P = 0.93, P = 0.12, respectively) and outperformed no decontamination, restrictive, and decontam (LB) (P = 10⁻⁸, P = 8.7×10⁻¹⁷ and P = 1.3×10⁻⁴, respectively). b-f, A simulation of a more complicated well-to-well leakage experiment, in which each taxa was placed in two monocultures instead of one. To simulate such a scenario, we randomly chose pairs of taxa, and then reassigned all reads assigned to one taxa across the experiment to the other, ‘focal’, taxa. For example, Minich et al. placed E. coli in well C10 (c), resulting in well-to-well leakage (d). We randomly selected well C3, containing a Corynbacterium species, and reassigned all Corynbacterium reads to E. coli (e). We then ran SCRuB on this simulated data, and evaluated the relative abundance of E. coli in its original well (b, f). We performed this 100 times, and examined the relative abundance of the focal taxa in its original well (b). In all cases, SCRuB accurately handled well-to-well leakage in this more complex scenario and avoided removing the taxa belonging to the focal monoculture.

Extended Data Fig. 8 SCRuB correctly infers well-to-well leakage into negative controls in a metagenomic study of infant and maternal microbiomes.

a, The plate design used by Lou et al.^33,39, which included a negative control placed in the corner of each extraction plate. Through a strain-level analysis, Lou et al. identified well-to-well leakage into certain negative controls. b, When running SCRuB on each plate, using the MAG abundances of each sample (Methods), we identified well-to-well leakage into the negative control in two of the four plates that included a negative control. c, SCRuB’s predictions of well-to-well leakage were consistent with an assessment based on the results of Lou et al.’s strain-level analysis (Methods).

Extended Data Fig. 9 Well-to-well leakage is more prominent during DNA extraction.

a,b, Plate layout during DNA extraction (a) and library preparation (b) of experiment 2 (Fig. 3a). 10 controls were included in the DNA extraction stage (triangles), and additional 7 in the library preparation stage (hexagon); a pair of each was away from other samples (‘far samples’, purple). c, Box and swarm plot (line, median; box, IQR; whiskers, 1.5*IQR) showing the Jensen-Shannon divergence (y-axis) between human-derived samples adjacent to DNA extraction and library preparation controls and the various controls of each processing stage, stratified by adjacent and near controls (purple in a,b), and calculated from ‘raw’ taxonomic compositions, without any decontamination. Samples are more similar to near than far controls, demonstrating well-to-well leakage occurring during both DNA extraction and library preparation. Samples are also more similar to near extraction controls than to near library controls, suggesting that well-to-well leakage is more prominent during DNA extraction. P, two-sided Mann-Whitney U; N, number of pairwise distances between relevant samples.

Extended Data Fig. 10 SCRuB improves prediction of melanoma and treatment response.

a-f, Receiver operating characteristic (ROC) curves evaluating the pairwise classification accuracy of gradient boosted decision trees on data from patients with lung cancer, prostate cancer, melanoma, and controls, using data from Poore et al.²⁰ Compared to alternative decontamination methods, SCRuB offers classification accuracy that is on-par or improved, and improved accuracy compared to the original analyses in all cases. See Supplementary Table 1 for P values comparing between methods. Shaded area, 95% confidence interval. g, A Venn diagram enumerating the number of taxa completely removed by each decontamination methods applied to the tumor microbiome data from Nejman et al.¹⁸ SCRuB removed fewer taxa than alternative methods.

Supplementary information

Supplementary Information

Supplementary Note.

Reporting Summary

Supplementary Tables

Supplementary Table 1: Exact P values displayed in figures. Supplementary Table 2: Experimental metadata and plate layouts of experiments performed. Refers to experiments described in Fig. 3. Supplementary Table 3: V1–V2 reads in control samples. The number of reads from the V1–V2 regions found in each of the samples from the experiments with human-derived samples (Fig. 3a; Methods). Samples with NA had no reads following DADA2 processing.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Austin, G.I., Park, H., Meydan, Y. et al. Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data. Nat Biotechnol 41, 1820–1828 (2023). https://doi.org/10.1038/s41587-023-01696-w

Download citation

Received: 17 May 2022
Accepted: 23 January 2023
Published: 16 March 2023
Issue Date: December 2023
DOI: https://doi.org/10.1038/s41587-023-01696-w

This article is cited by

Robustness of cancer microbiome signals over a broad range of methodological variation
- Gregory D. Sepich-Poore
- Daniel McDonald
- Rob Knight
Oncogene (2024)
Intracellular bacteria in cancer—prospects and debates
- Lena Schorr
- Marius Mathies
- Jens Puschhof
npj Biofilms and Microbiomes (2023)
Scrubbing contaminated microbiomes
- Yan Shao
Nature Reviews Microbiology (2023)