Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

A pan-cancer landscape of somatic mutations in non-unique regions of the human genome

Abstract

A substantial fraction of the human genome displays high sequence similarity with at least one other genomic sequence, posing a challenge for the identification of somatic mutations from short-read sequencing data. Here we annotate genomic variants in 2,658 cancers from the Pan-Cancer Analysis of Whole Genomes (PCAWG) cohort with links to similar sites across the human genome. We train a machine learning model to use signals distributed over multiple genomic sites to call somatic events in non-unique regions and validate the data against linked-read sequencing in an independent dataset. Using this approach, we uncover previously hidden mutations in ~1,700 coding sequences and in thousands of regulatory elements, including in known cancer genes, immunoglobulins and highly mutated gene families. Mutations in non-unique regions are consistent with mutations in unique regions in terms of mutation burden and substitution profiles. The analysis provides a systematic summary of the mutation events in non-unique regions at a genome-wide scale across multiple human cancers.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Calling mutations in non-unique regions of the genome.
Fig. 2: Concordance of simple and thesaurus mutational profiles.
Fig. 3: Mutation rates in functional regions.
Fig. 4: Thesaurus mutations in gene families.

Data availability

The PCAWG dataset is available through the ICGC data portal, https://dcc.icgc.org/pcawg. Somatic mutations called in this study are available at https://www.synapse.org/#!Synapse:syn22297877.

Code availability

The thesaurus annotation software is available at sourceforge.net/projects/geneticthesaurus/ and github.com/tkonopka/GeneticThesaurus.

References

  1. 1.

    The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020).

    CAS  Article  Google Scholar 

  2. 2.

    Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  3. 3.

    Li, Y. et al. Patterns of somatic structural variation in human cancer genomes. Nature 578, 112–121 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  4. 4.

    Rheinbay, E. et al. Analyses of non-coding somatic drivers in 2,658 cancer whole genomes. Nature 578, 102–111 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  5. 5.

    Gerstung, M. et al. The evolutionary history of 2,658 cancers. Nature 578, 122–128 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  6. 6.

    Karimzadeh, M., Ernst, C., Kundaje, A. & Hoffman, M. M. Umap and Bismap: quantifying genome and methylome mappability. Nucleic Acids Res. 46, e120 (2018).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  7. 7.

    Lee, H. & Schatz, M. C. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics 28, 2097–2105 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  8. 8.

    Treangen, T. J. & Salzberg, S. L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13, 36–46 (2011).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  9. 9.

    Mandelker, D. et al. Navigating highly homologous genes in a molecular diagnostic setting: a resource for clinical next-generation sequencing. Genet. Med. 18, 1282–1289 (2016).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  10. 10.

    Suzuki, I. K. et al. Human-specific NOTCH2NL genes expand cortical neurogenesis through Delta/Notch regulation. Cell 173, 1370–1384 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  11. 11.

    Suzuki, H. et al. Recurrent noncoding U1 snRNA mutations drive cryptic splicing in SHH medulloblastoma. Nature 574, 707–711 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  12. 12.

    Shuai, S. et al. The U1 spliceosomal RNA is recurrently mutated in multiple cancers. Nature 574, 712–716 (2019).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  13. 13.

    Kerzendorfer, C., Konopka, T. & Nijman, S. M. B. A thesaurus of genetic variation for interrogation of repetitive genomic regions. Nucleic Acids Res. 43, e68 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  14. 14.

    Konopka, T. & Nijman, S. M. B. Comparison of genetic variants in matched samples using thesaurus annotation. Bioinformatics 32, 657–663 (2016).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  15. 15.

    Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  16. 16.

    Ainscough, B. J. et al. A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data. Nat. Genet. 50, 1735–1743 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  17. 17.

    Anzar, I., Sverchkova, A., Stratford, R. & Clancy, T. NeoMutate: an ensemble machine learning framework for the prediction of somatic mutations in cancer. BMC Med. Genomics 12, 63 (2019).

    PubMed  PubMed Central  Article  Google Scholar 

  18. 18.

    Garcia-Prieto, C., Valencia, A. & Porta-Pardo, E. The consequences of variant calling decisions in secondary analyses of cancer sequencing data. Preprint at bioRxiv https://doi.org/10.1101/2020.01.29.924860 (2020).

  19. 19.

    Ellrott, K. et al. Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst. 6, 271–281 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  20. 20.

    Bishara, A. et al. Read clouds uncover variation in complex regions of the human genome. Genome Res. 25, 1570–1580 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  21. 21.

    Zheng, G. X. Y. et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat. Biotechnol. 34, 303–311 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  22. 22.

    Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  23. 23.

    Martincorena, I. et al. Universal patterns of selection in cancer and somatic tissues. Cell 173, 1823 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  24. 24.

    Chen, H. et al. Comprehensive assessment of computational algorithms in predicting cancer driver mutations. Genome Biol. 21, 43 (2020).

    PubMed  PubMed Central  Article  Google Scholar 

  25. 25.

    Araya, C. L. et al. Identification of significantly mutated regions across cancer types highlights a rich landscape of functional molecular alterations. Nat. Genet. 48, 117–125 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  26. 26.

    Bailey, M. H. et al. Comprehensive characterization of cancer driver genes and mutations. Cell 174, 1034–1035 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  27. 27.

    Tate, J. G. et al. COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids Res. 47, D941–D947 (2019).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  28. 28.

    Jäger, D. et al. Identification of a tissue-specific putative transcription factor in breast tissue by serological screening of a breast cancer library. Cancer Res. 61, 2055–2061 (2001).

    PubMed  PubMed Central  Google Scholar 

  29. 29.

    Tapparel, C. et al. The TPTE gene family: cellular expression, subcellular localization and alternative splicing. Gene 323, 189–199 (2003).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  30. 30.

    Jamaspishvili, T. et al. Clinical implications of PTEN loss in prostate cancer. Nat. Rev. Urol. 15, 222–234 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  31. 31.

    Hatakeyama, S. TRIM family proteins: roles in autophagy, immunity, and carcinogenesis. Trends Biochem. Sci 42, 297–311 (2017).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  32. 32.

    Usher, C. L. et al. Structural forms of the human amylase locus and their relationships to SNPs, haplotypes and obesity. Nat. Genet. 47, 921–925 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  33. 33.

    Barger, C. J. et al. Expression of the POTE gene family in human ovarian cancer. Sci. Rep. 8, 17136 (2018).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  34. 34.

    Teng, G. & Papavasiliou, F. N. Immunoglobulin somatic hypermutation. Annu. Rev. Genet. 41, 107–120 (2007).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  35. 35.

    Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).

    PubMed  PubMed Central  Article  Google Scholar 

  36. 36.

    Marks, P. et al. Resolving the full spectrum of human genome variation using linked-reads. Genome Res. 29, 635–645 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  37. 37.

    Priestley, P. et al. Pan-cancer whole-genome analyses of metastatic solid tumours. Nature 575, 210–216 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  38. 38.

    McGranahan, N. et al. Allele-specific HLA loss and immune escape in lung cancer evolution. Cell 171, 1259–1271 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  39. 39.

    Rodriguez-Martin, B. et al. Pan-cancer analysis of whole genomes identifies driver rearrangements promoted by LINE-1 retrotransposition. Nat. Genet. 52, 306–319 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  40. 40.

    Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  41. 41.

    Eichler, E. E. Genetic variation, comparative genomics, and the diagnosis of disease. N. Engl. J. Med. 381, 64–74 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  42. 42.

    Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  43. 43.

    Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  44. 44.

    Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  45. 45.

    Friedman, J., Hastie, T. & Tibshirani, R. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 28, 337–407 (2000).

    Article  Google Scholar 

  46. 46.

    McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3, 861 (2018).

    Article  Google Scholar 

  47. 47.

    Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2018).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

This work is supported by The Francis Crick Institute, which receives its core funding from Cancer Research UK (FC001202), the UK Medical Research Council (FC001202) and the Wellcome Trust (FC001202). M.T. was supported as a postdoctoral fellow by the European Union’s Horizon 2020 research and innovation program (Marie Skłodowska-Curie grant agreement 747852-SIOMICS) and is a postdoctoral researcher of the F.R.S.-FNRS. J.D. is a postdoctoral fellow of the Research Foundation, Flanders (FWO). A.M.F. is an NIHR senior investigator and is supported by the National Institute for Health Research, UCLH Biomedical Research Centre and the CRUK Experimental Cancer Centre. P.V.L. is a Winton Group Leader in recognition of the Winton Charitable Foundation’s support toward the establishment of The Francis Crick Institute. T.K. would like to thank D. Smedley. This project was enabled through the Crick Scientific Computing STP and through access to the MRC eMedLab Medical Bioinformatics infrastructure, supported by the UK Medical Research Council (grant number MR/L016311/1). The Bone Cancer Research Trust funded sample biobanking.

Author information

Affiliations

Authors

Contributions

All authors edited and approved the final manuscript. M.T. wrote the first draft of the paper, designed experiments, performed statistical analyses, performed bioinformatics analyses and performed data visualization. J.D. performed bioinformatics analyses of linked-read data. A.V. generated short-read and linked-read data. A.M.F. provided tumor samples and performed pathology assessments. P.V.L. wrote the first draft of the paper, designed experiments and supervised the study jointly with T.K. T.K. wrote the first draft of the paper, designed experiments, performed statistical analyses, performed bioinformatics analyses, performed data visualization and supervised the study jointly with P.V.L.

Corresponding authors

Correspondence to Maxime Tarabichi or Peter Van Loo or Tomasz Konopka.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–28.

Reporting Summary

Supplementary Table 1

Summary of z-scores and histology specificity for all genomic regions. The table contains summary statistics for all genes in the annotation set. Counts, z-scores and entropy scores are provided based on non-hypermutated samples.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tarabichi, M., Demeulemeester, J., Verfaillie, A. et al. A pan-cancer landscape of somatic mutations in non-unique regions of the human genome. Nat Biotechnol (2021). https://doi.org/10.1038/s41587-021-00971-y

Download citation

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing