Abstract
A substantial fraction of the human genome displays high sequence similarity with at least one other genomic sequence, posing a challenge for the identification of somatic mutations from short-read sequencing data. Here we annotate genomic variants in 2,658 cancers from the Pan-Cancer Analysis of Whole Genomes (PCAWG) cohort with links to similar sites across the human genome. We train a machine learning model to use signals distributed over multiple genomic sites to call somatic events in non-unique regions and validate the data against linked-read sequencing in an independent dataset. Using this approach, we uncover previously hidden mutations in ~1,700 coding sequences and in thousands of regulatory elements, including in known cancer genes, immunoglobulins and highly mutated gene families. Mutations in non-unique regions are consistent with mutations in unique regions in terms of mutation burden and substitution profiles. The analysis provides a systematic summary of the mutation events in non-unique regions at a genome-wide scale across multiple human cancers.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
PanCancer analysis of somatic mutations in repetitive regions reveals recurrent mutations in snRNA U2
npj Genomic Medicine Open Access 14 March 2022
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout




Data availability
The PCAWG dataset is available through the ICGC data portal, https://dcc.icgc.org/pcawg. Somatic mutations called in this study are available at https://www.synapse.org/#!Synapse:syn22297877.
Code availability
The thesaurus annotation software is available at sourceforge.net/projects/geneticthesaurus/ and github.com/tkonopka/GeneticThesaurus.
References
The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020).
Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020).
Li, Y. et al. Patterns of somatic structural variation in human cancer genomes. Nature 578, 112–121 (2020).
Rheinbay, E. et al. Analyses of non-coding somatic drivers in 2,658 cancer whole genomes. Nature 578, 102–111 (2020).
Gerstung, M. et al. The evolutionary history of 2,658 cancers. Nature 578, 122–128 (2020).
Karimzadeh, M., Ernst, C., Kundaje, A. & Hoffman, M. M. Umap and Bismap: quantifying genome and methylome mappability. Nucleic Acids Res. 46, e120 (2018).
Lee, H. & Schatz, M. C. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics 28, 2097–2105 (2012).
Treangen, T. J. & Salzberg, S. L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13, 36–46 (2011).
Mandelker, D. et al. Navigating highly homologous genes in a molecular diagnostic setting: a resource for clinical next-generation sequencing. Genet. Med. 18, 1282–1289 (2016).
Suzuki, I. K. et al. Human-specific NOTCH2NL genes expand cortical neurogenesis through Delta/Notch regulation. Cell 173, 1370–1384 (2018).
Suzuki, H. et al. Recurrent noncoding U1 snRNA mutations drive cryptic splicing in SHH medulloblastoma. Nature 574, 707–711 (2019).
Shuai, S. et al. The U1 spliceosomal RNA is recurrently mutated in multiple cancers. Nature 574, 712–716 (2019).
Kerzendorfer, C., Konopka, T. & Nijman, S. M. B. A thesaurus of genetic variation for interrogation of repetitive genomic regions. Nucleic Acids Res. 43, e68 (2015).
Konopka, T. & Nijman, S. M. B. Comparison of genetic variants in matched samples using thesaurus annotation. Bioinformatics 32, 657–663 (2016).
Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).
Ainscough, B. J. et al. A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data. Nat. Genet. 50, 1735–1743 (2018).
Anzar, I., Sverchkova, A., Stratford, R. & Clancy, T. NeoMutate: an ensemble machine learning framework for the prediction of somatic mutations in cancer. BMC Med. Genomics 12, 63 (2019).
Garcia-Prieto, C., Valencia, A. & Porta-Pardo, E. The consequences of variant calling decisions in secondary analyses of cancer sequencing data. Preprint at bioRxiv https://doi.org/10.1101/2020.01.29.924860 (2020).
Ellrott, K. et al. Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst. 6, 271–281 (2018).
Bishara, A. et al. Read clouds uncover variation in complex regions of the human genome. Genome Res. 25, 1570–1580 (2015).
Zheng, G. X. Y. et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat. Biotechnol. 34, 303–311 (2016).
Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).
Martincorena, I. et al. Universal patterns of selection in cancer and somatic tissues. Cell 173, 1823 (2018).
Chen, H. et al. Comprehensive assessment of computational algorithms in predicting cancer driver mutations. Genome Biol. 21, 43 (2020).
Araya, C. L. et al. Identification of significantly mutated regions across cancer types highlights a rich landscape of functional molecular alterations. Nat. Genet. 48, 117–125 (2015).
Bailey, M. H. et al. Comprehensive characterization of cancer driver genes and mutations. Cell 174, 1034–1035 (2018).
Tate, J. G. et al. COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids Res. 47, D941–D947 (2019).
Jäger, D. et al. Identification of a tissue-specific putative transcription factor in breast tissue by serological screening of a breast cancer library. Cancer Res. 61, 2055–2061 (2001).
Tapparel, C. et al. The TPTE gene family: cellular expression, subcellular localization and alternative splicing. Gene 323, 189–199 (2003).
Jamaspishvili, T. et al. Clinical implications of PTEN loss in prostate cancer. Nat. Rev. Urol. 15, 222–234 (2018).
Hatakeyama, S. TRIM family proteins: roles in autophagy, immunity, and carcinogenesis. Trends Biochem. Sci 42, 297–311 (2017).
Usher, C. L. et al. Structural forms of the human amylase locus and their relationships to SNPs, haplotypes and obesity. Nat. Genet. 47, 921–925 (2015).
Barger, C. J. et al. Expression of the POTE gene family in human ovarian cancer. Sci. Rep. 8, 17136 (2018).
Teng, G. & Papavasiliou, F. N. Immunoglobulin somatic hypermutation. Annu. Rev. Genet. 41, 107–120 (2007).
Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).
Marks, P. et al. Resolving the full spectrum of human genome variation using linked-reads. Genome Res. 29, 635–645 (2019).
Priestley, P. et al. Pan-cancer whole-genome analyses of metastatic solid tumours. Nature 575, 210–216 (2019).
McGranahan, N. et al. Allele-specific HLA loss and immune escape in lung cancer evolution. Cell 171, 1259–1271 (2017).
Rodriguez-Martin, B. et al. Pan-cancer analysis of whole genomes identifies driver rearrangements promoted by LINE-1 retrotransposition. Nat. Genet. 52, 306–319 (2020).
Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Eichler, E. E. Genetic variation, comparative genomics, and the diagnosis of disease. N. Engl. J. Med. 381, 64–74 (2019).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
Friedman, J., Hastie, T. & Tibshirani, R. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 28, 337–407 (2000).
McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3, 861 (2018).
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2018).
Acknowledgements
This work is supported by The Francis Crick Institute, which receives its core funding from Cancer Research UK (FC001202), the UK Medical Research Council (FC001202) and the Wellcome Trust (FC001202). M.T. was supported as a postdoctoral fellow by the European Union’s Horizon 2020 research and innovation program (Marie Skłodowska-Curie grant agreement 747852-SIOMICS) and is a postdoctoral researcher of the F.R.S.-FNRS. J.D. is a postdoctoral fellow of the Research Foundation, Flanders (FWO). A.M.F. is an NIHR senior investigator and is supported by the National Institute for Health Research, UCLH Biomedical Research Centre and the CRUK Experimental Cancer Centre. P.V.L. is a Winton Group Leader in recognition of the Winton Charitable Foundation’s support toward the establishment of The Francis Crick Institute. T.K. would like to thank D. Smedley. This project was enabled through the Crick Scientific Computing STP and through access to the MRC eMedLab Medical Bioinformatics infrastructure, supported by the UK Medical Research Council (grant number MR/L016311/1). The Bone Cancer Research Trust funded sample biobanking.
Author information
Authors and Affiliations
Contributions
All authors edited and approved the final manuscript. M.T. wrote the first draft of the paper, designed experiments, performed statistical analyses, performed bioinformatics analyses and performed data visualization. J.D. performed bioinformatics analyses of linked-read data. A.V. generated short-read and linked-read data. A.M.F. provided tumor samples and performed pathology assessments. P.V.L. wrote the first draft of the paper, designed experiments and supervised the study jointly with T.K. T.K. wrote the first draft of the paper, designed experiments, performed statistical analyses, performed bioinformatics analyses, performed data visualization and supervised the study jointly with P.V.L.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–28.
Supplementary Table 1
Summary of z-scores and histology specificity for all genomic regions. The table contains summary statistics for all genes in the annotation set. Counts, z-scores and entropy scores are provided based on non-hypermutated samples.
Rights and permissions
About this article
Cite this article
Tarabichi, M., Demeulemeester, J., Verfaillie, A. et al. A pan-cancer landscape of somatic mutations in non-unique regions of the human genome. Nat Biotechnol 39, 1589–1596 (2021). https://doi.org/10.1038/s41587-021-00971-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41587-021-00971-y
This article is cited by
-
PanCancer analysis of somatic mutations in repetitive regions reveals recurrent mutations in snRNA U2
npj Genomic Medicine (2022)