Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data

Abstract

Sequencing-based approaches for the analysis of microbial communities are susceptible to contamination, which could mask biological signals or generate artifactual ones. Methods for in silico decontamination using controls are routinely used, but do not make optimal use of information shared across samples and cannot handle taxa that only partially originate in contamination or leakage of biological material into controls. Here we present Source tracking for Contamination Removal in microBiomes (SCRuB), a probabilistic in silico decontamination method that incorporates shared information across multiple samples and controls to precisely identify and remove contamination. We validate the accuracy of SCRuB in multiple data-driven simulations and experiments, including induced contamination, and demonstrate that it outperforms state-of-the-art methods by an average of 15–20 times. We showcase the robustness of SCRuB across multiple ecosystems, data types and sequencing depths. Demonstrating its applicability to microbiome research, SCRuB facilitates improved predictions of host phenotypes, most notably the prediction of treatment response in melanoma patients using decontaminated tumor microbiome data.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: SCRuB demonstrates superior decontamination in simulated benchmarks.
Fig. 2: SCRuB correctly accounts for well-to-well leakage.
Fig. 3: SCRuB outperforms alternative decontamination methods in a benchmark with human-derived samples.
Fig. 4: SCRuB improves the prediction of melanoma and treatment response.

Similar content being viewed by others

Data availability

Sequencing data from our experiments, along with all relevant metadata, was uploaded to SRA, accession PRJNA905430 (ref. 55). All other datasets analyzed in this study are publicly available. The college dormitory dataset25 used in Fig. 1 and Extended Data Figs. 35 is available from the European Nucleotide Archive (ENA), accession ERP115809, and Qiita41, study ID 12470. The marine sediments dataset, used in Extended Data Fig. 3a,b, is available from Qiita41, study ID 11922. The fish microbiome dataset42, used in Extended Data Fig. 3c,d, is available from ENA, accession PRJEB54736, and Qiita41, study ID 13414. The Earth Microbiome Project soil dataset43, used in Extended Data Fig. 3e,f, is available from ENA, accession PRJEB42019, and Qiita41, study ID 13114. The office dataset44, used in Extended Data Fig. 3g,h, is available from ENA, accession PRJEB13115, and Qiita41, study ID 10423. The Central Park soil dataset45, used in Extended Data Fig. 3i,j, is available from ENA, accession PRJEB6614, and Qiita41, study ID 2104. The gut metagenomic dataset46, used in Extended Data Fig. 3k,l, is available from ENA, accession PRJEB50408, and Qiita41, study ID 13692. The negative controls dataset, used in Fig. 1, and Extended Data Figs. 3a–f, 4, 5 is available from Qiita41, study ID 12019; the one used in Extended Data Fig. 3g,h,k,l is available from ENA, accession PRJEB40903, and Qiita41, study ID 12201; and the one used in Extended Data Fig. 3i,j is available from ENA, accession PRJEB25617, and Qiita41, study ID 10333. The well-to-well leakage dataset32, is available from ENA, accession ERP115213. The plasma cfDNA data20 is available from ENA, accessions ERP119598, ERP119596 and ERP119597; and Qiita41, study IDs 12667, 12691 and 12692. The tumor microbiome dataset18 is available from SRA, accession PRJNA624822. The processed data was obtained from Supplementary Table 2 in ref. 18.

Code availability

SCRuB is available at https://github.com/Shenhav-and-Korem-labs/SCRuB56 and requires R (≥3.6.3), glmnet57 (4.1-4) and torch (1.3.1). A Code Ocean capsule replicating all analyses in this paper is available at https://codeocean.com/capsule/5737862/tree/v1 (ref. 58), with source code also available at https://github.com/Shenhav-and-Korem-labs/SCRuB_analysis. Both use tidyverse59 (0.7.2) and XGBoost60 (1.5.0). The decontamination pipeline used by Nejman et al.18 is available from Zenodo at https://doi.org/10.5281/zenodo.3740536, and the prediction pipeline used by Poore et al.20 is available at https://github.com/biocore/tcga.

References

  1. Salter, S. J. et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 12, 87 (2014).

    PubMed  PubMed Central  Google Scholar 

  2. Weyrich, L. S. et al. Laboratory contamination over time during low-biomass sample analysis. Mol. Ecol. Resour. 19, 982–996 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. Kim, D. et al. Optimizing methods and dodging pitfalls in microbiome research. Microbiome 5, 52 (2017).

    PubMed  PubMed Central  Google Scholar 

  4. Eisenhofer, R. et al. Contamination in low microbial biomass microbiome studies: issues and recommendations. Trends Microbiol. 27, 105–117 (2019).

    CAS  PubMed  Google Scholar 

  5. Weiss, S. et al. Tracking down the sources of experimental contamination in microbiome studies. Genome Biol. 15, 564 (2014).

    PubMed  PubMed Central  Google Scholar 

  6. Aagaard, K. et al. The placenta harbors a unique microbiome. Sci. Transl. Med. 6, 237ra65 (2014).

    PubMed  PubMed Central  Google Scholar 

  7. Parnell, L. A. et al. Microbial communities in placentas from term normal pregnancy exhibit spatially variable profiles. Sci Rep. 7, 11200 (2017).

    PubMed  PubMed Central  Google Scholar 

  8. Seferovic, M. D. et al. Visualization of microbes by 16S in situ hybridization in term and preterm placentas without intraamniotic infection. Am. J. Obstet. Gynecol. 221, 146.e1–146.e23 (2019).

    CAS  PubMed  Google Scholar 

  9. de Goffau, M. C. et al. Human placenta has no microbiome but can contain potential pathogens. Nature 572, 329–334 (2019).

    PubMed  PubMed Central  Google Scholar 

  10. Leiby, J. S. et al. Lack of detection of a human placenta microbiome in samples from preterm and term deliveries. Microbiome 6, 196 (2018).

    PubMed  PubMed Central  Google Scholar 

  11. Kuperman, A. A. et al. Deep microbial analysis of multiple placentas shows no evidence for a placental microbiome. BJOG 127, 159–169 (2020).

    CAS  PubMed  Google Scholar 

  12. Sinha, R., Abnet, C. C., White, O., Knight, R. & Huttenhower, C. The microbiome quality control project: baseline study design and future directions. Genome Biol. 16, 276 (2015).

    PubMed  PubMed Central  Google Scholar 

  13. Edmonds, K. & Williams, L. The role of the negative control in microbiome analyses. FASEB J. 31, 940.3 (2017).

    Google Scholar 

  14. Schierwagen, R. et al. Trust is good, control is better: technical considerations in blood microbiome analysis. Gut 69, 1362–1363 (2020).

    PubMed  Google Scholar 

  15. de Goffau, M. C. et al. Recognizing the reagent microbiome. Nat Microbiol 3, 851–853 (2018).

    PubMed  Google Scholar 

  16. van der Horst, J. et al. Sterile paper points as a bacterial DNA-contamination source in microbiome profiles of clinical samples. J. Dent. 41, 1297–1301 (2013).

    PubMed  Google Scholar 

  17. Olomu, I. N. et al. Elimination of ‘kitome’ and ‘splashome’ contamination results in lack of detection of a unique placental microbiome. BMC Microbiol. 20, 157 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. Nejman, D. et al. The human tumor microbiome is composed of tumor type-specific intracellular bacteria. Science 368, 973–980 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. Pinto-Ribeiro, I. et al. Evaluation of the use of formalin-fixed and paraffin-embedded archive gastric tissues for microbiota characterization using next-generation sequencing. Int. J. Mol. Sci. 21, 1096 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. Poore, G. D. et al. Microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature 579, 567–574 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. Wang, J. et al. Translocation of vaginal microbiota is involved in impairment and protection of uterine health. Nat. Commun. 12, 4191 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. Lam, S. Y. et al. Technical challenges regarding the use of formalin-fixed paraffin embedded (FFPE) tissue specimens for the detection of bacterial alterations in colorectal cancer. BMC Microbiol. 21, 297 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. Allali, I. et al. Gut microbiome compositional and functional differences between tumor and non-tumor adjacent tissues from cohorts from the US and Spain. Gut Microbes 6, 161–172 (2015).

    PubMed  PubMed Central  Google Scholar 

  24. Marotz, C. et al. SARS-CoV-2 detection status associates with bacterial community composition in patients and the hospital environment. Microbiome 9, 132 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. Richardson, M., Gottel, N., Gilbert, J. A. & Lax, S. Microbial similarity between students in a common dormitory environment reveals the forensic potential of individual microbial signatures. mBio 10, e01054-19 (2019).

    PubMed  PubMed Central  Google Scholar 

  26. Chen, Q.-L. et al. Rare microbial taxa as the major drivers of ecosystem multifunctionality in long-term fertilized soils. Soil Biol. Biochem. 141, 107686 (2020).

    CAS  Google Scholar 

  27. Smirnova, E., Huzurbazar, S. & Jafari, F. PERFect: PERmutation Filtering test for microbiome data. Biostatistics 20, 615–631 (2019).

    PubMed  Google Scholar 

  28. Davis, N. M., Proctor, D. M., Holmes, S. P., Relman, D. A. & Callahan, B. J. Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome 6, 226 (2018).

    PubMed  PubMed Central  Google Scholar 

  29. McKnight, D. T. et al. microDecon: a highly accurate read‐subtraction tool for the post‐sequencing removal of contamination in metabarcoding studies. Environ. DNA 1, 14–25 (2019).

    Google Scholar 

  30. Shenhav, L. et al. FEAST: fast expectation-maximization for microbial source tracking. Nat. Methods 16, 627–632 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. Knights, D. et al. Bayesian community-wide culture-independent microbial source tracking. Nat. Methods 8, 761–763 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. Minich, J. J. et al. Quantifying and understanding well-to-well contamination in microbiome research. mSystems 4, e00186-19 (2019).

    PubMed  PubMed Central  Google Scholar 

  33. Lou, Y. C. et al. Using strain-resolved analysis to identify contamination in metagenomics data. Preprint at bioRxiv https://doi.org/10.1101/2022.01.16.476537 (2022).

  34. An, U. et al. STENSL: Microbial Source Tracking with ENvironment SeLection. mSystems 7, e0099521 (2022).

    PubMed  Google Scholar 

  35. Bolyen, E. et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 37, 852–857 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. Karstens, L. et al. Controlling for contaminants in low-biomass 16S rRNA gene sequencing experiments. mSystems 4, e00290-19 (2019).

    PubMed  PubMed Central  Google Scholar 

  37. Flores, R. et al. Collection media and delayed freezing effects on microbial composition of human stool. Microbiome 3, 33 (2015).

    PubMed  PubMed Central  Google Scholar 

  38. Adams, R. I., Bateman, A. C., Bik, H. M. & Meadow, J. F. Microbiota of the indoor environment: a meta-analysis. Microbiome 3, 49 (2015).

    PubMed  PubMed Central  Google Scholar 

  39. Lou, Y. C. et al. Infant gut strain persistence is associated with maternal origin, phylogeny, and traits including surface adhesion and iron acquisition. Cell Rep. Med. 2, 100393 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. Hornung, B. V. H., Zwittink, R. D. & Kuijper, E. J. Issues and current standards of controls in microbiome research. FEMS Microbiol. Ecol. 95, fiz045 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  41. Gonzalez, A. et al. Qiita: rapid, web-enabled microbiome meta-analysis. Nat. Methods 15, 796–798 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. Minich, J. J. et al. Host biology, ecology and the environment influence microbial biomass and diversity in 101 marine fish species. Nat. Commun. 13, 6978 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. Shaffer, J. P. et al. Standardized multi-omics of Earth’s microbiomes reveals microbial and metabolite diversity. Nat Microbiol. 7, 2128–2150 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. Chase, J. et al. Geography and location are the primary drivers of office microbiome composition. mSystems 1, e00022-16 (2016).

    PubMed  PubMed Central  Google Scholar 

  45. Ramirez, K. S. et al. Biogeographic patterns in below-ground diversity in New York City’s Central Park are similar to those observed globally. Proc. Biol. Sci. 281, 20141988 (2014).

    PubMed  PubMed Central  Google Scholar 

  46. Hanes, D. et al. The gastrointestinal and microbiome impact of a resistant starch blend from potato, banana, and apple fibers: a randomized clinical trial using smart caps. Front. Nutr. 9, 987216 (2022).

    PubMed  PubMed Central  Google Scholar 

  47. Shaffer, J. P. et al. A comparison of DNA/RNA extraction protocols for high-throughput sequencing of microbial communities. Biotechniques 70, 149–159 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. Ruiz-Calderon, J. F. et al. Walls talk: microbial biogeography of homes spanning urbanization. Sci. Adv. 2, e1501061 (2016).

    PubMed  PubMed Central  Google Scholar 

  49. Robin, X. et al. pROC: an open-source package for R and S to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).

    PubMed  PubMed Central  Google Scholar 

  50. Callahan, B. J. et al. DADA2: high-resolution sample inference from Illumina amplicon data. Nat. Methods 13, 581–583 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  51. Annavajhala, M. K. et al. Oral and gut microbial diversity and immune regulation in patients with HIV on antiretroviral therapy. mSphere 5, e00798-19 (2020).

    PubMed  PubMed Central  Google Scholar 

  52. Graspeuntner, S., Loeper, N., Künzel, S., Baines, J. F. & Rupp, J. Selection of validated hypervariable regions is crucial in 16S-based microbiota studies of the female genital tract. Sci. Rep. 8, 9678 (2018).

    PubMed  PubMed Central  Google Scholar 

  53. Herlemann, D. P. et al. Transitions in bacterial communities along the 2000 km salinity gradient of the Baltic Sea. ISME J. 5, 1571–1579 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).

    PubMed  PubMed Central  Google Scholar 

  55. Austin, G. I. et al. Contamination benchmark using human-derived samples. NCBI https://www.ncbi.nlm.nih.gov/bioproject/PRJNA905430 (2022).

  56. Austin, G. I., Shenhav, L. & Korem, T. SCRuB. GitHuB https://github.com/Shenhav-and-Korem-labs/SCRuB (2023).

  57. Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).

    PubMed  PubMed Central  Google Scholar 

  58. Shenhav, L., Korem, T., & Austin, G. Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data. Code Ocean https://doi.org/10.24433/CO.2307706.v1 (2023).

  59. Wickham, H. et al. Welcome to the tidyverse. J. Open Source Softw. 4, 1686 (2019).

    Google Scholar 

  60. Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds Krishnapuram, B. et al.) 785–794 (ACM, 2016).

Download references

Acknowledgements

We thank members of the Korem group for useful discussions. We are grateful to G. D. Poore, C. Martino, R. Knight, R. Straussman and I. Livyatan for assistance with analyzing and interpreting data from their studies, and to R. Straussman and I. Livyatan for helpful comments on the paper. In general, we thank all authors and participants involved in the generation of all data used in this study. The study was supported by the center for studies in Physics and Biology at Rockefeller University (L.S.), the Program for Mathematical Genomics at Columbia University (T.K.), the CIFAR Azrieli Global Scholarship in the Humans & the Microbiome Program (T.K.), R01HD106017 (T.K.) and R01CA245894 (A.-C.U.).

Author information

Authors and Affiliations

Authors

Contributions

G.I.A. wrote SCRuB, and designed and conducted all computational analyses. H.K. designed and conducted all experiments. Y.M. assisted with analyses. D.S. contributed to experiments. T.S. collected samples. A.M.C. supervised sample collection. A.-C.U supervised all experiments. Y.C.L, B.F, M.M and J.F.B assisted in obtaining, analyzing and interpreting data from their study. L.S. and T.K. conceived and designed the study, designed analysis, jointly supervised the study and contributed equally to this work. G.I.A., I.P., L.S. and T.K. interpreted the results and wrote the paper.

Corresponding authors

Correspondence to Liat Shenhav or Tal Korem.

Ethics declarations

Competing interests

A.-C.U. has received research funding from Merck that is unrelated to this study. The other authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Empirical validation of the source-tracking assumption in data from Nejman et al.18.

The source-tracking assumption30,31,34 in the context of contamination stipulates that taxa present together in a contamination source will be introduced together to other samples, and in similar proportions as in the contamination source. We demonstrate this empirically using data from Nejamn et al.18. a, The average relative abundance of each ASV (y-axis) across samples from the Netherlands Cancer Institute, plotted against the abundance of the same ASV across negative controls from the same batch (x-axis; ‘No Template Controls’ in Nejman et al.18), separated to ‘high’ and ‘low’ contamination based on SCRuB’s prediction (contamination parameter p > 0.5 and p ≤ 0.5 respectively). Consistent with the source-tracking assumption, taxa present together in a contamination source are introduced together to the samples, and in similar proportions, resulting in a clear positive correlation between the relative abundance of the taxa that are shared between samples and controls (Pearson R = 0.99, P < 10−20 and R = 0.082, P = 0.037 for high and low contamination, respectively). As expected, this correlation varies with respect to SCRuB’s predicted contamination in the samples: samples predicted to have high-contamination (blue) have a slope of 0.97, while those predicted to have low-contamination have a slope of 0.057. b,c, Same as (a) for samples predicted to have the highest (b) and lowest (c) contamination. Pearson R is displayed for panels with >3 shared taxa. Correlation was very high for highly contaminated samples (Pearson R > 0.9, P < 10−4 for all).

Extended Data Fig. 2 Description of our simulation framework.

A visualization of the simulation framework used to benchmark different decontamination methods. We implemented our simulation with the 3 outlined steps: a, We generate a dataset with 88–94 samples, 2, 4 or 8 controls, and a contamination source from an unrelated study, assumed to be biologically distinct from the samples of interest. All samples are then assigned locations across the plate. b, We add well-to-well leakage to the controls, and contamination from the shared source to the samples of interest (Methods). c, We run decontamination using one of several methods (Methods). The decontaminated dataset is evaluated against the ground truth noncontaminated taxonomic compositions using the Jensen-Shannon divergence.

Extended Data Fig. 3 SCRuB outperforms alternative decontamination methods under in silico simulations of diverse environments and data types.

a-l, Same as Fig. 1c, d, but for simulations based on data from 16S amplicon sequencing of tropical marine sediments (Qiita41 study ID 11922; a,b); 16S amplicon sequencing of multiple body sites from southern California fish42 (c,d); 16S amplicon sequencing of soil from the Earth Microbiome Project43 (e,f); ITS sequencing of office samples44 (g,h); 18S amplicon sequencing of soil from Central Park, New York45 (i,j); and human gut metagenomic sequencing46 (k,l). N = 120 simulations per panel. Across almost all simulation scenarios and environments SCRuB outperforms alternative decontamination approaches. Contamination levels were fixed to 5% for the simulations in panels b, d, f, h, j, and l. Box line, median; box, IQR; whiskers, 1.5*IQR; *, one-sided Wilcoxon signed-rank P < 10−4 for comparison between SCRuB and marked method (see Supplementary Table 1 for exact P values).

Extended Data Fig. 4 SCRuB is robust to evaluation metrics and simulation parameters.

a-d, Same as Fig. 1c, d, box and swarm plot (line, median; box, IQR; whiskers, 1.5*IQR) showing the mean (a,b) and standard deviation (c,d) of the Jensen-Shannon divergence (JSD) between the ground truth of each experiment and its decontamination output. SCRuB performs similarly when evaluated using mean JSD, and displays stable standard deviation. e,f, Same as Fig. 1c, d, but with controls placed along the edge of a plate rather than randomly. Similar to Fig. 1c, d, SCRuB outperforms alternative methods under all parameters except no decontamination and microDecon with 50% well-to-well leakage levels. g, Shown are the results from Fig. 1d with well-to-well leakage levels of 5%, stratified by the number of controls (N = 10 experiments per set). SCRuB outperforms alternative decontamination methods regardless of the number of controls (one-sided Wilcoxon signed-rank P < 10−3 for all, P = 0.0029 vs. microDecon with one control). h, Same as Fig. 1d, showing also results from SCRuB running without sample location, and thus without accounting for well-to-well leakage. While SCRuB outperforms SCRuB without sample locations in all simulations (P < 10−4 for all), SCRuB without sample locations still outperforms alternative decontamination methods in many settings. *, one-sided Wilcoxon signed-rank P < 10−3 (panel g) P < 10−4 (otherwise) for comparison between SCRuB (panels a-g) and SCRuB without sample locations (panel h) and the marked method (see Supplementary Table 1 for exact P values). * is on the bottom if the marked method has better performance.

Extended Data Fig. 5 SCRuB is robust to sequencing depth.

Shown are results from in silico simulations under our model (Methods). a, Comparison between experiments in which the read counts of all samples were set to either 1,000, 5,000, 10,000, or 25,000 reads, under contamination and well-to-well leakage levels of 5%. With the exception of the depth of 1,000 reads, SCRuB outperformed the alternative methods in all simulations (one-sided Wilcoxon signed-rank P < 10−3 for all). At a depth of 1,000 reads, SCRuB had comparable performance to decontam (P = 0.19), and significantly outperformed the rest (P < 0.01 for all). b, For each experiment, the mean read depth was set to 10,000, the standard deviation to 2,500, and the contamination and well-to-well leakage levels to 5%. We divided the samples from each experiment into four groups, Q1-Q4, based on the within-experiment quantile to which the read depth of each sample belonged to. Within all groups, SCRuB outperformed alternative decontamination methods (P < 10−3 for all), demonstrating that SCRuB has consistent performance within an experiment with varying read depths. c, Results from experiments with a mean read depth of 10,000, standard deviation of 0, 500, 2,500 or 7,500, and contamination and well-to-well leakage levels of to 5%. Across all standard deviations, SCRuB outperformed competing methods, demonstrating that it is robust to variability in read coverage across experiments. Box line, median; box, IQR; box whiskers, 1.5*IQR; *, one-sided Wilcoxon signed-rank P < 0.01 for comparison between SCRuB and marked method (see Supplementary Table 1 for exact P values).

Extended Data Fig. 6 SCRuB correctly handles unrelated controls.

a, Venn diagram illustrating the taxa removed by each decontamination method, defined as a taxa with an aggregate sum greater than zero in the observed data, and an aggregate sum of zero in the decontaminated data. When presented with unrelated controls, SCRuB removed far fewer taxa than microDecon and either version of decontam, and the majority of taxa removed by SCRuB were also removed by microDecon and decontam (LB). b, Box and swarm plots (line, median; box, IQR; whiskers, 1.5*IQR) showing the median Jensen-Shannon divergence per simulation between simulated samples before and after decontamination with an unrelated control (Methods), across 50 simulated datasets of 88 samples and 8 negative controls. SCRuB is robust to non-informative controls, producing taxonomic compositions that are very close to the original, and significantly closer than alternative methods (one-sided Wilcoxon signed-rank P = 4×10−10, P = 8.8×10−10 and P = 3.8×10−10 between SCRuB and microDecon, decontam or decontam (LB), respectively).

Extended Data Fig. 7 SCRuB correctly accounts for well-to-well leakage.

a, Similar to Fig. 2f, showing the Jensen-Shannon divergence (y-axis) between the ground truth taxonomic composition, as defined by the experimental design of Minich et al.31 (Methods), and the taxonomic composition of the unprocessed dataset (‘No decontamination’), or the dataset following decontamination by various methods (x-axis), and displayed separately for the 31 distinct low-prevalence (left) and 90 high-prevalence (right) monocultures. For low prevalence samples, SCRuB produced estimates that were significantly more similar to the ground truth compared to microDecon, decontam, decontam (LB), and to a restrictive approach (one-sided Wilcoxon P < 10−4 in all cases). For the high prevalence samples, SCRuB performed comparably to decontam and microDecon (P = 0.93, P = 0.12, respectively) and outperformed no decontamination, restrictive, and decontam (LB) (P = 10−8, P = 8.7×10−17 and P = 1.3×10−4, respectively). b-f, A simulation of a more complicated well-to-well leakage experiment, in which each taxa was placed in two monocultures instead of one. To simulate such a scenario, we randomly chose pairs of taxa, and then reassigned all reads assigned to one taxa across the experiment to the other, ‘focal’, taxa. For example, Minich et al. placed E. coli in well C10 (c), resulting in well-to-well leakage (d). We randomly selected well C3, containing a Corynbacterium species, and reassigned all Corynbacterium reads to E. coli (e). We then ran SCRuB on this simulated data, and evaluated the relative abundance of E. coli in its original well (b, f). We performed this 100 times, and examined the relative abundance of the focal taxa in its original well (b). In all cases, SCRuB accurately handled well-to-well leakage in this more complex scenario and avoided removing the taxa belonging to the focal monoculture.

Extended Data Fig. 8 SCRuB correctly infers well-to-well leakage into negative controls in a metagenomic study of infant and maternal microbiomes.

a, The plate design used by Lou et al.33,39, which included a negative control placed in the corner of each extraction plate. Through a strain-level analysis, Lou et al. identified well-to-well leakage into certain negative controls. b, When running SCRuB on each plate, using the MAG abundances of each sample (Methods), we identified well-to-well leakage into the negative control in two of the four plates that included a negative control. c, SCRuB’s predictions of well-to-well leakage were consistent with an assessment based on the results of Lou et al.’s strain-level analysis (Methods).

Extended Data Fig. 9 Well-to-well leakage is more prominent during DNA extraction.

a,b, Plate layout during DNA extraction (a) and library preparation (b) of experiment 2 (Fig. 3a). 10 controls were included in the DNA extraction stage (triangles), and additional 7 in the library preparation stage (hexagon); a pair of each was away from other samples (‘far samples’, purple). c, Box and swarm plot (line, median; box, IQR; whiskers, 1.5*IQR) showing the Jensen-Shannon divergence (y-axis) between human-derived samples adjacent to DNA extraction and library preparation controls and the various controls of each processing stage, stratified by adjacent and near controls (purple in a,b), and calculated from ‘raw’ taxonomic compositions, without any decontamination. Samples are more similar to near than far controls, demonstrating well-to-well leakage occurring during both DNA extraction and library preparation. Samples are also more similar to near extraction controls than to near library controls, suggesting that well-to-well leakage is more prominent during DNA extraction. P, two-sided Mann-Whitney U; N, number of pairwise distances between relevant samples.

Extended Data Fig. 10 SCRuB improves prediction of melanoma and treatment response.

a-f, Receiver operating characteristic (ROC) curves evaluating the pairwise classification accuracy of gradient boosted decision trees on data from patients with lung cancer, prostate cancer, melanoma, and controls, using data from Poore et al.20 Compared to alternative decontamination methods, SCRuB offers classification accuracy that is on-par or improved, and improved accuracy compared to the original analyses in all cases. See Supplementary Table 1 for P values comparing between methods. Shaded area, 95% confidence interval. g, A Venn diagram enumerating the number of taxa completely removed by each decontamination methods applied to the tumor microbiome data from Nejman et al.18 SCRuB removed fewer taxa than alternative methods.

Supplementary information

Supplementary Information

Supplementary Note.

Reporting Summary

Supplementary Tables

Supplementary Table 1: Exact P values displayed in figures. Supplementary Table 2: Experimental metadata and plate layouts of experiments performed. Refers to experiments described in Fig. 3. Supplementary Table 3: V1–V2 reads in control samples. The number of reads from the V1–V2 regions found in each of the samples from the experiments with human-derived samples (Fig. 3a; Methods). Samples with NA had no reads following DADA2 processing.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Austin, G.I., Park, H., Meydan, Y. et al. Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data. Nat Biotechnol 41, 1820–1828 (2023). https://doi.org/10.1038/s41587-023-01696-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-023-01696-w

This article is cited by

Search

Quick links

Nature Briefing Microbiology

Sign up for the Nature Briefing: Microbiology newsletter — what matters in microbiology research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: Microbiology