Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Colibactin DNA-damage signature indicates mutational impact in colorectal cancer


The mucosal epithelium is a common target of damage by chronic bacterial infections and the accompanying toxins, and most cancers originate from this tissue. We investigated whether colibactin, a potent genotoxin1 associated with certain strains of Escherichia coli2, creates a specific DNA-damage signature in infected human colorectal cells. Notably, the genomic contexts of colibactin-induced DNA double-strand breaks were enriched for an AT-rich hexameric sequence motif, associated with distinct DNA-shape characteristics. A survey of somatic mutations at colibactin target sites of several thousand cancer genomes revealed notable enrichment of this motif in colorectal cancers. Moreover, the exact double-strand-break loci corresponded with mutational hot spots in cancer genomes, reminiscent of a trinucleotide signature previously identified in healthy colorectal epithelial cells3. The present study provides evidence for the etiological role of colibactin in human cancer.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Colibactin preferentially damages DNA in specific AT-rich motifs.
Fig. 2: Colibactin-specific binding motif corresponds to extreme DNA-shape parameter values and electrostatic potential.
Fig. 3: Several cancers show enrichment of mutations in CDMs.
Fig. 4: Analysis of colibactin-induced DSB break ends and mutational patterns in CDMs indicate damaged positions.

Data availability

Input FASTA files have been submitted to the Gene Expression Omnibus under accession no. GSE145594 and analysis scripts are available from All other data are available in the main text or supplementary materials.


  1. 1.

    Nougayrede, J. P. et al. Escherichia coli induces DNA double-strand breaks in eukaryotic cells. Science 313, 848–851 (2006).

    CAS  PubMed  Google Scholar 

  2. 2.

    Putze, J. et al. Genetic structure and distribution of the colibactin genomic island among members of the family Enterobacteriaceae. Infect. Immun. 77, 4696–4703 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. 3.

    Lee-Six, H. et al. The landscape of somatic mutation in normal colorectal epithelial cells. Nature 574, 532–537 (2019).

    CAS  PubMed  Google Scholar 

  4. 4.

    Watanabe, T., Tada, M., Nagai, H., Sasaki, S. & Nakao, M. Helicobacter pylori infection induces gastric cancer in Mongolian gerbils. Gastroenterology 115, 642–648 (1998).

    CAS  PubMed  Google Scholar 

  5. 5.

    Castellarin, M. et al. Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma. Genome Res. 22, 299–306 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. 6.

    Arthur, J. C. et al. Intestinal inflammation targets cancer-inducing activity of the microbiota. Science 338, 120–123 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. 7.

    Cougnoux, A. et al. Bacterial genotoxin colibactin promotes colon tumour growth by inducing a senescence-associated secretory phenotype. Gut 63, 1932–1942 (2014).

    CAS  PubMed  Google Scholar 

  8. 8.

    Bleich, R. M. & Arthur, J. C. Revealing a microbial carcinogen. Science 363, 689–690 (2019).

    CAS  PubMed  Google Scholar 

  9. 9.

    Hibbing, M. E., Fuqua, C., Parsek, M. R. & Peterson, S. B. Bacterial competition: surviving and thriving in the microbial jungle. Nat. Rev. Microbiol. 8, 15–25 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Takahashi, I. et al. Duocarmycin A, a new antitumor antibiotic from Streptomyces. J. Antibiot. (Tokyo) 41, 1915–1917 (1988).

    CAS  Google Scholar 

  11. 11.

    Igarashi, Y. et al. Yatakemycin, a novel antifungal antibiotic produced by Streptomyces sp. TP-A0356. J. Antibiot. (Tokyo) 56, 107–113 (2003).

    CAS  Google Scholar 

  12. 12.

    Arcamone, F., Penco, S., Orezzi, P., Nicolella, V. & Pirelli, A. Structure and synthesis of distamycin A. Nature 203, 1064–1065 (1964).

    CAS  PubMed  Google Scholar 

  13. 13.

    Finlay, A., Hochstein, F., Sobin, B. & Murphy, F. Netropsin, a new antibiotic produced by a Streptomyces. J. Am. Chem. Soc. 73, 341–343 (1951).

    CAS  Google Scholar 

  14. 14.

    Boger, D. L. & Johnson, D. S. CC-1065 and the duocarmycins: unraveling the keys to a new class of naturally derived DNA alkylating agents. Proc. Natl Acad. Sci. USA 92, 3642–3649 (1995).

    CAS  PubMed  Google Scholar 

  15. 15.

    Zha, L. et al. Colibactin assembly line enzymes use S-adenosylmethionine to build a cyclopropane ring. Nat. Chem. Biol. 13, 1063–1065 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. 16.

    Healy, A. R., Vizcaino, M. I., Crawford, J. M. & Herzon, S. B. Convergent and modular synthesis of candidate precolibactins. Structural revision of precolibactin A. J. Am. Chem. Soc. 138, 5426–5432 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17.

    Zha, L., Wilson, M. R., Brotherton, C. A. & Balskus, E. P. Characterization of polyketide synthase machinery from the pks island facilitates isolation of a candidate precolibactin. ACS Chem. Biol. 11, 1287–1295 (2016).

    CAS  PubMed  Google Scholar 

  18. 18.

    Vizcaino, M. I. & Crawford, J. M. The colibactin warhead crosslinks DNA. Nat. Chem. 7, 411–417 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. 19.

    Wilson, M. R. et al. The human gut bacterial genotoxin colibactin alkylates DNA. Science 363, eaar7785 (2019).

  20. 20.

    Xue, M. et al. Structure elucidation of colibactin and its DNA cross-links. Science 365, eaax2685 (2019).

  21. 21.

    Xue, M., Wernke, K. & Herzon, S. B. Depurination of colibactin-derived interstrand cross-links. Biochemistry 59, 892–900 (2020).

    CAS  PubMed  Google Scholar 

  22. 22.

    Yan, W. X. et al. BLISS is a versatile and quantitative method for genome-wide profiling of DNA double-strand breaks. Nat. Commun. 8, 15058 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. 23.

    Canela, A. et al. Genome organization drives chromosome fragility. Cell 170, 507–521.e518 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Yang, F., Kemp, C. J. & Henikoff, S. Anthracyclines induce double-strand DNA breaks at active gene promoters. Mutat. Res. 773, 9–15 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Sheffield, N. C. & Bock, C. LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor. Bioinformatics 32, 587–589 (2016).

    CAS  PubMed  Google Scholar 

  26. 26.

    Bailey, T. L. DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics 27, 1653–1659 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. 27.

    Tubbs, A. et al. Dual roles of poly(dA:dT) tracts in replication initiation and fork collapse. Cell 174, 1127–1142.e1119 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. 28.

    Nelson, H. C., Finch, J. T., Luisi, B. F. & Klug, A. The structure of an oligo(dA):oligo(dT) tract and its biological implications. Nature 330, 221–226 (1987).

    CAS  PubMed  Google Scholar 

  29. 29.

    Yuan, G. C. et al. Genome-scale identification of nucleosome positions in S. cerevisiae. Science 309, 626–630 (2005).

    CAS  PubMed  Google Scholar 

  30. 30.

    Tse, W. C. & Boger, D. L. Sequence-selective DNA recognition: natural products and nature’s lessons. Chem. Biol. 11, 1607–1617 (2004).

    CAS  PubMed  Google Scholar 

  31. 31.

    Drsata, T. et al. Mechanical properties of symmetric and asymmetric DNA A-tracts: implications for looping and nucleosome positioning. Nucleic Acids Res. 42, 7383–7394 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Buc, E. et al. High prevalence of mucosa-associated E. coli producing cyclomodulin and genotoxin in colon cancer. PLoS ONE 8, e56964 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  33. 33.

    Giannakis, M. et al. Genomic correlates of immune-cell Infiltrates in colorectal carcinoma. Cell Rep, 15, 857–865 (2016).

    CAS  Google Scholar 

  34. 34.

    Katainen, R. et al. CTCF/cohesin-binding sites are frequently mutated in cancer. Nat. Genet. 47, 818–821 (2015).

    CAS  PubMed  Google Scholar 

  35. 35.

    Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. 36.

    Iannelli, F. et al. A damaged genome’s transcriptional landscape through multilayered expression profiling around in situ-mapped DNA double-strand breaks. Nat. Commun. 8, 15656 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. 37.

    Iyama, T. & Wilson, D. M. 3rd DNA repair mechanisms in dividing and non-dividing cells. DNA Repair 12, 620–636 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. 38.

    Choi, J. Y., Lim, S., Kim, E. J., Jo, A. & Guengerich, F. P. Translesion synthesis across abasic lesions by human B-family and Y-family DNA polymerases alpha, delta, eta, iota, kappa, and REV1. J. Mol. Biol. 404, 34–44 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. 39.

    Martin, L. P., Hamilton, T. C. & Schilder, R. J. Platinum resistance: the role of DNA repair pathways. Clin, Cancer Res. 14, 1291–1295 (2008).

    CAS  Google Scholar 

  40. 40.

    Morin, P. J. et al. Activation of beta-catenin-Tcf signaling in colon cancer by mutations in beta-catenin or APC. Science 275, 1787–1790 (1997).

    CAS  PubMed  Google Scholar 

  41. 41.

    Pleguezuelos-Manzano, C. et al. Mutational signature in colorectal cancer caused by genotoxic pks + E. coli. Nature 580, 269–273 (2020).

    CAS  PubMed  Google Scholar 

  42. 42.

    Dziubańska-Kusibab, P. J. et al. Colibactin DNA damage signature indicates causative role in colorectal cancer. Preprint at (2019).

  43. 43.

    Gothe, H. J. et al. Spatial chromosome folding and active transcription drive DNA fragility and formation of oncogenic MLL translocations. Mol. Cell 75, 267–283.e212 (2019).

    CAS  PubMed  Google Scholar 

  44. 44.

    Zhang, F. et al. Breaks labeling in situ and sequencing (BLISS). Protocol Exchange (2017).

  45. 45.

    Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    PubMed  PubMed Central  Google Scholar 

  46. 46.

    Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  47. 47.

    The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  48. 48.

    Speir, M. L. et al. The UCSC Genome Browser database: 2016 update. Nucleic Acids Res. 44, D717–D725 (2016).

    CAS  PubMed  Google Scholar 

  49. 49.

    Schneider, T. D. & Stephens, R. M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18, 6097–6100 (1990).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. 50.

    Bembom, O. seqLogo: sequence logos for DNA sequence alignments. R package version 1.44.0 (2017).

  51. 51.

    Chiu, T. P. et al. DNAshapeR: an R/Bioconductor package for DNA shape prediction and feature encoding. Bioinformatics 32, 1211–1213 (2016).

    CAS  PubMed  Google Scholar 

  52. 52.

    Gelpi, J. L. et al. Classical molecular interaction potentials: improved setup procedure in molecular dynamics simulations of proteins. Proteins 45, 428–437 (2001).

    CAS  PubMed  Google Scholar 

  53. 53.

    Wang, J., Wolf, R. M., Caldwell, J. W., Kollman, P. A. & Case, D. A. Development and testing of a general amber force field. J. Comput. Chem. 25, 1157–1174 (2004).

    CAS  PubMed  Google Scholar 

  54. 54.

    Sousa da Silva, A. W. & Vranken, W. F. ACPYPE: AnteChamber PYthon Parser interfacE. BMC Res. Notes 5, 367 (2012).

    PubMed  PubMed Central  Google Scholar 

  55. 55.

    van Zundert, G. C. P. et al. The HADDOCK2.2 web server: user-friendly integrative modeling of biomolecular complexes. J. Mol. Biol. 428, 720–725 (2016).

    Google Scholar 

  56. 56.

    Dans, P. D. et al. Long-timescale dynamics of the Drew–Dickerson dodecamer. Nucleic Acids Res. 44, 4052–4066 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  57. 57.

    Perez, A., Luque, F. J. & Orozco, M. Dynamics of B-DNA on the microsecond time scale. J. Am. Chem. Soc. 129, 14739–14745 (2007).

    CAS  PubMed  Google Scholar 

  58. 58.

    Jorgensen, W. L., Chandrasekhar, J., Madura, J. D., Impey, R. W. & Klein, M. L. Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79, 926–935 (1983).

    CAS  Google Scholar 

  59. 59.

    Berendsen, H. J., Postma, J. V., van Gunsteren, W. F., DiNola, A. & Haak, J. R. Molecular dynamics with coupling to an external bath. J. Chem. Phys. 81, 3684–3690 (1984).

    CAS  Google Scholar 

  60. 60.

    Ryckaert, J.-P., Ciccotti, G. & Berendsen, H. J. Numerical integration of the cartesian equations of motion of a system with constraints: molecular dynamics of n-alkanes. J. Comput. Phys. 23, 327–341 (1977).

    CAS  Google Scholar 

  61. 61.

    Ivani, I. et al. Parmbsc1: a refined force field for DNA simulations. Nat. Methods 13, 55–58 (2016).

    CAS  PubMed  Google Scholar 

  62. 62.

    Case, D. et al. AMBER 18 (University of California, 2018).

  63. 63.

    Roe, D. R. & Cheatham, T. E. 3rd PTRAJ and CPPTRAJ: software for processing and analysis of molecular dynamics trajectory data. J. Chem. Theory Comput. 9, 3084–3095 (2013).

    CAS  PubMed  Google Scholar 

  64. 64.

    Humphrey, W., Dalke, A. & Schulten, K. VMD: visual molecular dynamics. J. Mol. Graph. 14, 33–38 (1996). 27-38.

    CAS  PubMed  Google Scholar 

  65. 65.

    Ellrott, K. et al. Scalable open dcience approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst. 6, 271–281.e277 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  66. 66.

    Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. B 57, 289–300 (1995).

    Google Scholar 

  67. 67.

    Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  68. 68.

    Katainen, R. et al. Discovery of potential causative mutations in human coding and noncoding genome with the interactive software BasePlayer. Nat. Protoc. 13, 2580–2600 (2018).

    CAS  PubMed  Google Scholar 

  69. 69.

    Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020).

  70. 70.

    Rosenthal, R., McGranahan, N., Herrero, J., Taylor, B. S. & Swanton, C. DeconstructSigs: delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. Genome Biol. 17, 31 (2016).

    PubMed  PubMed Central  Google Scholar 

  71. 71.

    Tamborero, D. et al. Cancer genome interpreter annotates the biological and clinical relevance of tumor alterations. Genome Med. 10, 25 (2018).

    PubMed  PubMed Central  Google Scholar 

  72. 72.

    Bailey, M. H. et al. Comprehensive characterization of cancer driver genes and mutations. Cell 173, 371–385 e318 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  73. 73.

    Tate, J. G. et al. COSMIC: the catalogue of somatic mutations In cancer. Nucleic Acids Res. 47, D941–d947 (2019).

    CAS  PubMed  Google Scholar 

  74. 74.

    Gaffney, D. J. et al. Controls of nucleosome positioning in the human genome. PLoS Genet. 8, e1003036 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  75. 75.

    R Core Team. A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2018).

Download references


The results shown here are in whole or part based on data generated by the TCGA Research Network: We thank S. Garnerone from N.C.’s laboratory for processing raw sBLISS data, P. D. Dans from the Biophysical Chemistry Lab, Department of Biological Science (CENUR Litoral Norte), UdelaR, Universidad de la República, for constructive discussions about the theoretical model of colibactin, U. Dobrindt from University of Münster for providing E. coli strains, and K. Lapid and R. Zietlow for editing the manuscript. We also acknowledge the computational resources provided by the CSC–IT Center for Science, Finland. P.J.D.-K. was supported by the IMPRS-IDI graduate school. B.A.M.B. was supported by a Rubicon fellowship from the Netherlands Organisation for Scientific Research. M.O. is an academic researcher of Institució Catalana de Recerca i Estudis Avancats. The present study was supported by the German Research Foundation (grant no. ME705/18-1) to T.F.M., by the Spanish Ministry of Science (grant nos. BIO2015-64802-R, BFU2014-61670-EXP and BFU2014-52864-R), the Catalan Government (grant no. 2014-SGR), Instituto de Salud Carlos III: Instituto Nacional de Bioinformática (PT 13/000/0030), the Biomolecular and Bioinformatics Resources Platform and the EU Horizon 2020 research and innovation program (grant nos. Elixir-Excelerate 676559 and BioExcel2:823830), and the MINECO Severo Ochoa Award of Excellence to the IRB Barcelona, and by the Ragnar Söderberg Foundation, the Swedish Foundation for Strategic Research (BD15-0095) and the Strategic Research Programme in Cancer (StratCan) at Karolinska Institutet to N.C. L.A.A. was supported by grants from the Academy of Finland (Finnish Center of Excellence no. 312041, Academy Professor grant nos. 319083 and 320149, iCAN Flagship 320185), Jane and Aatos Erkko Foundation, Sigrid Juselius Foundation, Helsinki Institute of Life Science and the Cancer Society of Finland.

Author information




F.B., B.A.M.B., A.I. and R.K. are co-second authors. P.J.D.-K., H.B. and T.F.M. conceived the project and designed experiments. N.C. and B.A.M.B. designed the sBLISS experiments. P.J.D.-K. and B.A.M.B. performed the sBLISS experiments. A.I. performed the E. coli infection and immunostaining experiments. P.J.D.-K. and A.I. prepared the samples for sBLISS. P.J.D.-K. and H.B. performed the bioinformatics analysis. F.B. and M.O. built the theoretical model of colibactin. R.K., T.C. and L.A.A. provided and analyzed the WGS CRC data. P.J.D.-K., H.B. and T.F.M. wrote the manuscript.

Corresponding author

Correspondence to Thomas F. Meyer.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Alison Farrell was the primary editor on this article, and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Overview of the analyses provided in this study.

a, The overall results generated in this work derive from four independent investigations of the colibactin-induced double-strand breaks (DSBs), using sBLISS technology. The resulting data sets were first used to identify the colibactin damage motif (CDM), which turned out to be best described by the consensus hexameric sequence AAWWTT. b, Topological analyses and the assignment of shape characteristics led to a virtual model of colibactin–DNA interaction, which is compatible with a colibactin-induced inter-strand cross-link formation between two adenines at positions 2(+) and 5(-) of the minor groove. c, The detection of the CDM served to identify associated single nucleotide substitutions in large human cancer genome data sets - both exome (TCGA) and whole genome (Finish cohort). The comparison of different cancer entities indicated notable enrichment of CDM-associated mutations in some colorectal cancer. The comparison of colon cancer types indicated an enrichment at distal colon locations. The classification of CDM-associated mutations revealed a strong correlation with known triplet signatures, above all with the signature SBSA, which was previously identified in normal colon epithelia. d, An independent analysis of the exact DSB endpoints led to the identification of putative processing sites following colibactin-induced DNA damage. Accordingly, the processing sites were located downstream (in a 5’>3’ direction) of the putatively-alkylated adenines on opposite strands, generating 2 nt long 5’ overhangs at the broken DNA. Alignment of these break positions with the location of colibactin-specific cancer mutations revealed a striking local correspondence, indicating the occurrence of mutations at distinct sites on the CDM. e, Overall, the sBLISS-derived analyses provide extensive insight into the mutational function of colibactin and reveal strong links to human cancer.

Extended Data Fig. 2 Validation of sBLISS for the detection of colibactin-associated DSB break points.

a, In contrast to infection with pks- E. coli, Caco-2 cells infected with colibactin-producing pks+ E. coli displayed γH2AX expression, megalocytosis and multinucleation. Representative image of two independent experiments. b, sBLISS signal of etoposide-induced DSBs showed increased counts at TSS as compared to untreated control conditions. c, This heatmap indicates the log2 odds ratio of break enrichment in genomic region collections (FDR < 5%) as compared to the rest of the genome: untreated, etoposide-treated, pks+ E. coli-infected and pks- E. coli-infected Caco-2 cells. Rows represent known genomic regions from different genome annotation collections. Etoposide treatment increased DSB enrichment in TSS-associated features, while DSBs in pks+ E. coli-infected cells show a rather weaker enrichment compared to untreated and pks- E. coli-infected cells, indicating a more uniform distribution across the genome.

Extended Data Fig. 3 Enrichment of motifs in the context of DSBs upon infection with pks+ E. coli or etoposide treatment.

a, Using the DREME analysis, we determined sequence motifs in the context of the DSBs in pks+ E. coli- vs. pks- E. coli-infected cells, n=4. The top three motifs identified for each biological replicate are presented. b, Using the DREME analysis, we determined sequence motifs in the context of the DSBs in etoposide treated vs. untreated control cells, n=4. The top ten motifs identified for each biological replicate are presented.

Extended Data Fig. 4 Enrichment of CDMs compared to other AT-rich sequences and in nucleosome-free regions.

a, The enrichment of hexanucleotide sequences in close proximity to DSB positions (±7 nt) upon different treatments. Bars represent the enrichment of DSB-associated hexanucleotide motifs (mean log2 ratios of DSB at each motif comparing two conditions) in pks+ E. coli-infected cells vs. non-infected (NT) cells, n=4 independent experiments. Error bars denote the 95% confidence interval (CI) around the mean log2 ratios. b, The enrichment of pentanucleotide sequences in close proximity to DSB positions (±7 nt) upon different treatments. Bars represent the enrichment of DSB-associated pentanucleotide motifs (mean log2 ratios of DSB at each motif comparing two conditions) in pks+ E. coli-infected vs. pks- E. coli-infected cells (left panel) or in pks- E. coli-infected vs. non-infected (NT) cells (right panel), n=4 independent experiments. Error bars denote the 95% CI around the mean log2 ratios. c, Scaled mean Log2 enrichment values (scaled to 1 for maximum enrichment per replicate) of all nonamers in the context of DSBs detected by sBLISS, n=4 independent experiments. The x-axis represents the number of A/T in a nonamer. Red box plots represent AAWWTT motifs in the nonamer sequence, while turquoise box plots represent all other motifs. Each data point corresponds to one nonamer. d, The distributions of hexanucleotide patterns containing only A/T around centers of nucleosome dyads in the human genome as determined by 80. The values are smoothed proportions of hexanucleotides within ±100 bp next to the nucleosome dyad centers. Dashed vertical lines represent ± 73 bp that indicate the size of the nucleosome-covered DNA. Grey curves represent mean ± 2*SD of all A/T hexanucleotides, except for the AAWWTT motif, AAAAAA, ATATAT, TATATA and their reverse complements, n=33 hexanucleotide motifs. Full curves represent the AAWWTT motifs. Dashed curved represent the outlier motifs AAAAAA, ATATAT, TATATA, showing clearly distinct profiles from all other hexanucleotide profiles.

Extended Data Fig. 5 Topological features of hexameric sequences and the CDMs.

a, The relationship between different hexanucleotide sequences (log2 ratio standardized to 1; mean of 4 independent experiments) and the values of predicted DNA shape parameters. CDMs are noted, demonstrating notable enrichment. For propeller twist (ProT), the mean of the values are computed for the 3rd and 4th nucleotide of the motif; for roll and helical twists (HelT), the mean of the values are computed for base-pair steps 2 to 4 in each hexamer. Error bars denote the 95% CI around mean scaled log2 ratios for each hexanucleotide sequence, n=4. b, The distance between the cyclopropanes (Å) of colibactin by running a molecular dynamics simulation of free colibactin in water, n=1. The horizontal red line identifies the average values of this distance (12.8 Å). c, The correlations of hexanucleotide motif enrichment (mean log2 ratio) with DNA shape parameters (left - 3 columns) and central dinucleotides (rightmost column), depending on the presence and distance of adenines on opposite strands, n=4 independent experiments. The presence and distance of adenines are classified in the following groups (from top to bottom, all sequences in 5’-3’ direction): A-T at 3 nt distance, A-T at 4 nt distance, A-T at 3 and 4 nt distance, T-A at 3 nt distance, T-A at 4 nt distance, T-A at 3 and 4 nt distance, no A-T/T-A at 3 or 4 nt distance. Lines in the left three columns are mean log2 ratios smoothed by local regression, method LOESS (locally estimated scatterplot smoothing).

Extended Data Fig. 6 Colibactin-related mutational characteristics in human cancers (I).

a, Enrichment of SBS mutations at AAATTT/AAAATT CDMs in exome sequences obtained from the TCGA project. Top row: cancer entities that showed enrichment across all subcohorts. Second to fourth row: cancer entities that showed enrichment only for POLE-mutated cases or no enrichment at all. COAD - colon adenocarcinoma, READ - rectal adenocarcinoma, STAD - stomach adenocarcinoma, UCEC - uterine corpus endometroid cancer, BLCA - bladder cancer, CESC - cervix squamous cell carcinoma, HNSC - head and neck squamous cell cancer, LUAD - lung adenocarcinoma, LUSC - lung squamous cell carcinoma, DLBC – diffuse large B-cell lymphoma, LAML – acute myeloid leukemia, SARC - sarcoma. Asterisks - significant difference (two-sided Mann-Whitney-U test, p < 0.05 and FDR < 20%) between CDMs and all WWWWW motifs. Numbers under asterisks – number of mutations overlapping AAWWTT / all mutations, [number of samples per cohort]. Error bars = 0 ± 2MAD intervals of mutation rates (mutations/base-pair) of WWWWWW sequences, excluding CDMs, after subtracting their mean. Dots - mutation rates for AAATTT/AAAATT after mean subtraction. Crosses - means of the CDMs. Q1-Q4 cohorts defined by the quantiles of total SNV numbers. Outliers - samples with SNV numbers significantly larger than the 95% CI in each cancer entity. POLE - polymerase epsilon mutated cases.

Extended Data Fig. 7 Cross-contamination of cancer mutation enrichments at AAWWTT sequences.

a, A heat map of SBS pentanucleotide change patterns in colorectal cancer (CRC) genomes based on the PCAWG project, n=60. The results indicate that the established hypermutator-associated SBS signature SBS28 strongly affects one AAWWTT-associated pentanucleotide sequence (red arrow). SBS28 possibly biases AAWWTT mutation estimates (leftmost annotation column for AAWWTT-derived pentanucleotide signature) in POLE/POLD1-mutated CRCs. b, A heat map of SBS pentanucleotide change patterns in stomach adenocarcinoma (STAD) genomes based on the PCAWG project, n=75. The results indicate that the established SBS signature SBS17b strongly affects two AAWWTT-associated pentanucleotide sequences (red arrows). SBS17b possibly biases AAWWTT mutation estimates (leftmost annotation column for AAWWTT-derived pentanucleotide signature) in STADs. The heatmaps show relative proportions of pentanucleotide changes (n=133, rows) associated with AAWWTT-related sequences or SBS28/SBS17b trinucleotide changes. The rows are ordered by the strength of expected contribution from SBS28 (CRC, A) or SBS17b (STAD, B) signatures, followed by AAWWTT contributions. Expected contributions from each signature to pentanucleotide changes are shown as color code on the top of the heatmap for each row. The left part of the heatmap represents pentanucleotide changes with contributions greater than zero in SBS28 or SBS17b (A or B, respectively), whereas the right part represents pentanucleotide changes with expected contributions of zero. Estimated contributions of SBS28, SBS17b and SBSA per tumor sample that are greater than zero are shown as a color code on the left of the heatmap. Black marks on the left mark cases with protein-coding gene-changing mutations in POLE or POLD1. The weighted sum of AAWWTT-related mutation proportions are shown as a color code in the leftmost annotation column. Red arrows below the heatmap denote AWWTT-associated pentanucleotide changes with high contributions from SBS28 or SBS17b.

Extended Data Fig. 8 Colibactin-related mutational characteristics in human cancers (II).

a, The analysis of SBS mutations at colibactin-specific hexanucleotide motifs, AAAATT and AAATTT, in whole genome somatic mutation data obtained from (n=200 CRC) 70. Top - differences in log2(mutations/bp covered by motif) between colibactin-specific and all other WWWWW motifs. Middle - total mutation count at CDMs. Bottom – proportions of total mutations that overlap with CDMs in mutated cases of MSS, MSI, and POLE. Asterisks denote significant difference (two-sided Mann-Whitney-U test, p < 0.05 and FDR < 20%) between CDMs and all other motifs with the same A:T content and length (AAAATTT/AAATTT vs. WWWWWW motifs). Error bars describe the ± 2MAD intervals for mutation rate (mutations/bp covered by motif) of WWWWWW motifs, excluding CDMs after subtracting their mean. Dots represent the mutation rates for the two colibactin-specific hexanucleotides after subtracting the mean of the WWWWWW motifs. Crosses represent the mean of the CDMs. POLE -polymerase epsilon mutated cases. b, Left panel: the proportions of cancer samples with protein-changing mutations among entity-specific or pan-cancer driver genes across several cancer entities in the TCGA data set. Right panel: the number of unique mutations (that is, unique site and nucleotide change) per cancer site divided by the total sample number of mutations in the entity. For both panels, error bars denote 95% CI around this proportion per cancer entity or cancer site. Numbers on the top of the bars indicate total samples per entity or cancer site. Blue bars – the proportion of mutations at positions 2 or 5 of AAWWTT, green bars – the proportion of mutations at other positions in AAWWTT, red bars – the proportion of all other protein-coding gene-changing mutations overlapping the less stringent pentanucleotide motifs, AAATT/AATTT or AAAAT/ATTTT.

Extended Data Fig. 9 Schematic representation of sBLISS-derived double-strand end configurations.

a, Blunting of original break ends by T4 polymerase preserves the original 5’ ends, which can then be determined by sequencing using the sBLISS protocol. Mapping of sBLISS reads to the genome allows us to derive strand-specific break configurations across all genomic sites, scanning the same motif. Frequencies of break ends depend on the break end configuration and differ for blunt ends or breaks with 3’ and 5’ overhangs. b, Validation of the break end analysis in a cell line with an inducible AsiSI restriction enzyme 48. The identified break positions correspond to the known break points of this restriction enzyme 82.

Supplementary information

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dziubańska-Kusibab, P.J., Berger, H., Battistini, F. et al. Colibactin DNA-damage signature indicates mutational impact in colorectal cancer. Nat Med 26, 1063–1069 (2020).

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing