Focus on TCGA Pan-Cancer Analysis

Thread 1: Mutational drivers

Journal name:
Nature Genetics
Year published:

A tumor gains its selective advantage from ‘driver’ mutations in genes involved in key pathways regulating cell identity, survival and genome stability. However, tumorigenic processes are mutagenic, and there may be many more mutated ‘passenger’ genes that confer no further advantage to the cell and expose the tumor to immune surveillance by the body. Some genes are recurrently mutated, whereas others are rarely mutated. The Pan-Cancer analysis group used comparisons across tumor types to refine its ability to discriminate driver mutations, enabling the improved delineation of the pathways through which driver mutations exert their effects on tumor initiation, proliferation and spread.

Nature Genetics Emerging landscape of oncogenic signatures across human cancers Giovanni Ciriello et al. 10.1038/ng.2762

The complex landscapes of somatic modifications observed in tumors are typically the result of a relatively small number of functional oncogenic alterations (sometimes called driver events), which are outnumbered by non-functional alterations (passenger events) that do not substantially contribute to oncogenesis and progression8. The low signal to noise ratio (ratio of the number of functional to non-functional events) presents a major challenge for data mining or data analysis.

Here we distilled thousands of genetic and epigenetic features altered in cancers to ~500 selected functional events (SFEs). Using this simplified description, we derived a hierarchical classification of 3,299 TCGA tumors from 12 cancer types. The top classes are dominated by either mutations (M class) or copy number changes (C class). This distinction is clearest at the extremes of genomic instability, indicating the presence of different oncogenic processes.

At the top of this hierarchical classification, we identified two main tumor classes of similar size, each characterized by distinct sets of SFEs (Fig. 2a). Unexpectedly, although the distinction between copy number alterations and mutations was not used as a feature in our classification, these characteristic events were predominantly somatic mutations in one class and copy number alterations in the other (Fig. 2b). Closer inspection of the distribution of selected functional events showed a striking inverse relationship between copy number alterations and somatic mutations at the extremes of genomic instability, particularly in highly altered tumors (Fig. 2c).

The first partition of the pan-cancer data set identifies two main classes primarily characterized by either recurrent mutations (M class) or recurrent copy number alterations (C class).
The first partition of the pan-cancer data set identifies two main classes primarily characterized by either recurrent mutations (M class) or recurrent copy number alterations (C class).

(a) Each class is composed of multiple tumor types in different proportions. (b) SFEs were tested for significant enrichment (more frequent than expected in a random distribution) in each class (events along the x axis, log-scaled q values on the y axis). Highly enriched events are primarily mutations in the M class and copy number alterations in the C class. Mut, mutation; meth, methylation change; amp, amplification; del, deletion. (c) The distribution of SFEs in tumors indicates that the number of copy number alterations in a sample (x axis) is approximately anticorrelated with the number of somatic mutations in a sample (y axis). The number of samples for a given (x,y) position range from 0 (white) to 243 (dark blue). CNAs, copy number alterations.

Full size image

Starting from this first major subdivision, we applied the network modularity algorithm recursively to the C class and M class tumors and to their subclasses. The result was hierarchical division into several levels of subclasses characterized by distinct patterns of functional alteration at each level of granularity (Fig. 3, Supplementary Fig. 5 and Supplementary Table 3).

Characteristic patterns of functional alterations and distinct oncogenic processes as determinants of oncogenic signature classes (OSCs).
Characteristic patterns of functional alterations and distinct oncogenic processes as determinants of oncogenic signature classes (OSCs).

(a) The first partition of the tree-like stratification (starting with 'all tumors' on the left) identifies two main classes: the M class (green) and the C class (red). We identify 17 oncogenic signature subclasses for the M class (M1–M17) and 14 oncogenic signature subclasses for the C class (C1–C14) (one row per subclass). (b) Each subclass includes subsets of tumors from several cancer types (grayscale heatmap; gray intensity represents the fraction of samples in a particular tumor type (column) and a particular subclass (row)). (c) Tree classification is determined at each level by sets of characteristic functional events (color intensity represents the fraction of samples in a subclass (row) affected by a particular functional event (column)). For functional copy number alterations, we indicate, if present, known oncogenes and tumor suppressors in parentheses, for example, 8q24 (MYC). (d,e) Subclass characteristic events reflect particular cellular processes (color intensity represents the fraction of samples in a subclass (row) affected by alterations to a particular process (column)) (d) and altered pathways involved in each of the processes (e). RTK, receptor tyrosine kinase; DSB, double-strand break.

Full size image

Notably, TP53 mutations were an exception to this trend, as they were strongly enriched in the C class (q = 3 × 10−176), consistent with early mutations in TP53 causing copy number genomic instability (Supplementary Fig. 1). This division into two main tumor classes indicates that recurrent copy number alterations and mutations are predominant in different subsets of tumors.

TP53-mutated tumors.
TP53-mutated tumors.

a,TP53-mutated tumors within the M-class have a higher number of recurrent copy number alterations (for all SFEs) than TP53 wild type samples. b, While both missense (residue-changing) and truncating TP53 mutations are more frequent in the C-class than in the M-class, this enrichment is significantly stronger for truncating mutations.

Full size image

How somatic copy number alterations (SCNAs) affect cancer genes

Nature Genetics Pan-cancer patterns of somatic copy number alteration Travis Zack, Steven Schumacher et al. 10.1038/ng.2760

Determining how somatic copy number alterations (SCNAs) promote cancer is an important goal. We characterized SCNA patterns in 4,934 cancers from The Cancer Genome Atlas Pan-Cancer data set. Whole-genome doubling, observed in 37% of cancers, was associated with higher rates of every other type of SCNA, TP53 mutations, CCNE1 amplifications and alterations of the PPP2R complex. SCNAs that were internal to chromosomes tended to be shorter than telomere-bounded SCNAs, suggesting different mechanisms underlying their generation. Significantly recurrent focal SCNAs were observed in 140 regions, including 102 without known oncogene or tumor suppressor gene targets and 50 with significantly mutated genes.

Tissue types from similar lineages tended to have similar rates of amplification and deletion in peak SCNA regions (Fig. 3a). We observed clusters of squamous cell carcinomas (head and neck squamous cell carcinoma, lung squamous cell carcinoma and bladder cancer) and reproductive cancers (ovarian and endometrial cancer) with breast cancer.

Significantly recurrent focal SCNAs.
Significantly recurrent focal SCNAs.

(a) Frequencies of amplification minus frequencies of deletion (red and blue indicate greater frequencies of amplifications and deletions, respectively) across lineages (x axis; see Supplementary Table 1 for a list of lineage abbreviations) for all 84 significant peak regions of SCNA, arranged in order of significance (y axis). The ordering of lineages reflects the results of unsupervised hierarchical clustering of these data. Magnified views of the values for the ten most significant amplification and deletion peaks are shown to the right, alongside candidate targets for these regions. Criteria for selecting the indicated candidates are described in the Online Methods. (b) Associated terms in the literature in peak regions containing fewer than 25 genes, according to a GRAIL analysis of all peak regions (top) and peak regions without known cancer genes or large genes (bottom).

Full size image

The features most associated with genes in the amplification and deletion peak regions are known to be associated with cancer (Fig. 3b). We applied GRAIL37, which uses literature citations, to find common features of genes in selected regions of the genome.

Robust methodologies for driver detection

Nature Mutational heterogeneity in cancer and the search for new cancer-associated genes Michael Lawrence, Petar Stojanov, Paz Polak et al. 10.1038/nature12213

Here we describe a fundamental problem with cancer genome studies: as the sample size increases, the list of putatively significant genes produced by current analytical methods burgeons into the hundreds. [] By incorporating mutational heterogeneity into the analyses, MutSigCV is able to eliminate most of the apparent artefactual findings and enable the identification of genes truly associated with cancer.

We analysed heterogeneity across patients with a given cancer type. Analysis of the 27 cancer types revealed that the median frequency of non-synonymous mutations varied by more than 1,000-fold across cancer types (Fig. 1).

Somatic mutation frequencies observed in exomes from 3,083 tumour-normal pairs.
Somatic mutation frequencies observed in exomes from 3,083 tumour-normal pairs.

Each dot corresponds to a tumour-normal pair, with vertical position indicating the total frequency of somatic mutations in the exome. Tumour types are ordered by their median somatic mutation frequency, with the lowest frequencies (left) found in haematological and paediatric tumours, and the highest (right) in tumours induced by carcinogens such as tobacco smoke and ultraviolet light. Mutation frequencies vary more than 1,000-fold between lowest and highest across different cancers and also within several tumour types. The bottom panel shows the relative proportions of the six different possible base-pair substitutions, as indicated in the legend on the left.

Full size image

Significantly mutated genes (SMGs)

Nature Mutational landscape and significance across 12 major cancer types Cyriac Kandoth, Michael McLellan et al. 10.1038/nature12634

In order to identify any genes that show positive selection in individual tumor types and across the 12 tumor types, the MuSiC-SMG test was performed to find those genes displaying significantly higher mutation frequencies than background. Our systematic analysis guided by gene expression data and manual curation (see Methods) discovered 127 significantly mutated genes (SMGs, Supplementary Table 4). Notably, 3,053 out of 3,281 total samples (93%) across the 12 Pan-Cancer types had at least one non-synonymous mutation in at least one of these 127 SMGs. These SMGs are involved in a wide range of cellular processes and can be broadly classified into 20 categories (Fig. 2). The top categories include transcription factors/regulators (21 genes), histone modifiers (13 genes), genome integrity (13 genes), RTK signaling (9 genes), cell cycle (7 genes), MAPK signaling (7 genes), PI3K signaling (6 genes), Wnt/β-catenin signaling (58 genes), histone (3 genes), ubiquitin mediated proteolysis (3 genes), and splicing (3 genes) (Fig. 2).

127 significantly mutated genes from over 20 cellular processes in cancer identified in 12 tumor types.
127 significantly mutated genes from over 20 cellular processes in cancer identified in 12 tumor types.

Percentages of samples mutated in individual tumor types and Pan-Cancer are shown, with highest percentage in each gene among 12 cancer types in bold.

Full size image

Combining multiple signals of positive selection to identify cancer drivers

Scientific Reports Comprehensive identification of mutational cancer driver genes across 12 tumor types David Tamborero, Abel Gonzalez-Perez et al. 10.1038/srep02650

Driver genes can be identified by detecting signals of positive selection in their mutational pattern across tumors. High frequency of mutations is the most intuitive of these signals (detected by MutSigCV and MuSiC). Other complementary signals include: functional impact bias (OncodriveFM), clustering of mutations (OncodriveCLUST) and overrepresentation of mutations in phosphorylation sites (ActiveDriver) (Fig. 1a). Here we show that the combination of complementary methods allows identifying a comprehensive and reliable list of cancer driver genes. We describe the analysis of somatic mutations obtained via exome sequencing of 3,205 tumors from 12 tumor types by the Cancer Genome Atlas (TCGA) research network using these five complementary approaches. We combined the lists of driver candidates identified by these five methods both across the whole Pan-Cancer dataset and in each individual tumor type using a rule-based approach. This analysis results in the detection of 291 high-confidence mutational cancer driver genes (HCD) acting in these tumors (Fig. 1b). Among those genes, some have not been previously identified as cancer drivers and 16 have clear preference to sustain mutations in one specific tumor type.

Figure 1
Figure 1

A) Illustration of the four signals of positive selection used to identify driver genes and the methods that implement them. B) Venn diagram showing the contribution of each method in number of genes that it detects to the list of HCDs. The names of the genes detected by 3 or more methods are shown.

Full size image

One hundred and sixty-five of these candidates are novel findings not included in the CGC.

Thirteen selected non-CGC, or novel cancer genes are depicted in Figure 4 within their functional interaction context. These novel driver candidates appear alongside other well-established cancer genes.

Figure 4
Figure 4

A) Diagram showing 13 selected candidate cancer genes within their functional interaction context. B) Heat-map depicting the frequency and number of samples with PAMs of the 13 selected ‘novel’ cancer genes in each tumor type and in the complete pan-cancer dataset. Colored circles indicate methods identifying each gene either in the per-project analyses or in the pan-cancer analysis. Note that six of the genes in the Figure show two signals of positive selection and are therefore not included within the HCDs due to their connections with other drivers.

Full size image

A resource to explore cancer drivers across tumor types

Nature Methods IntOGen-mutations identifies cancer drivers across tumor types Abel Gonzalez-Perez et al. 10.1038/nmeth.2642

In addition to the data generated by the TCGA Research Network there are other initiatives focused on tumor genome resequencing, including projects within the International Cancer Genome Consortium and other independent projects.

The IntOGen-mutations platform ( summarizes somatic mutations, genes and pathways involved in tumorigenesis.

The IntOGen-mutations pipeline integrates the results of tumor genomes analyzed with different mutation-calling workflows and is scalable to hundreds of thousands of tumor genomes. It currently includes OncodriveFM7, a tool that detects genes that are significantly biased toward the accumulation of mutations with high functional impact (FM bias) without the need to estimate background mutation rate8, and OncodriveCLUST9, which picks up genes whose mutations tend to cluster in particular regions of the protein sequence with respect to synonymous mutations (CLUST bias) (Online Methods). Both tools detect signals of positive selection, which appear in genes whose mutations are selected during tumor development and are therefore likely drivers.

These scores are subsequently transformed (with transFIC14) to compensate for the differences in baseline tolerance among genes, and each mutation is classified into one of four broad groups of impact, ranging from “None” to “High,” according to its consequence type and its transFIC MutationAssessor score (Fig. 1a).

We have analyzed somatic mutations in 4,623 samples from 31 different projects covering 13 anatomical sites (mainly from the International Cancer Genome Consortium (ICGC)1 and the TCGA2) (Supplementary Tables 1–3).

A systematic analysis of sequenced tumor genomes permits a broad view of the impact of genes in tumorigenesis across cancer types (Supplementary Fig. 2). For example, TP53, ARID1A, KRAS or PIK3CA are frequently mutated and identified as cancer drivers in most cancer sites. Other genes, such as VHL in kidney, MAPK3 and GATA3 in breast and STK11 in lung, seem to be primarily tumor-specific drivers.

Results in IntOGen for a list of selected genes.
Results in IntOGen for a list of selected genes.

The upper panel indicates if the genes are detected as drivers by OncodriveFM (blue squares) or OncodriveCLUST (red circles) in each project. In the lower panel the projects are aggregated by cancer site. Numbers indicate the total of samples with mutations in each gene and cancer site; the frequency is shown in a color scale from white to purple.

Full size image

The results of the pipeline are automatically loaded into a Web browser managed by the Onexus framework (Supplementary Fig. 1).

The results can be browsed through the Web (Supplementary Note 2) and with Gitools interactive heat maps15 (

The pipeline may be downloaded and can also be run online on our servers. It can be used to identify drivers from newly sequenced cohorts of tumor samples (Supplementary Note 3) and to interpret the mutations observed in a tumor sample (Supplementary Note 4).

Drivers significantly mutated by common mutator processes

Nature Genetics Evidence for APOBEC3B mutagenesis in multiple human cancers Michael Burns, Nuri Temiz & Reuben Harris 10.1038/ng.2701

Thousands of somatic mutations accrue in most human cancers, and their causes are largely unknown. We recently showed that the DNA cytidine deaminase APOBEC3B accounts for up to half of the mutational load in breast carcinomas expressing this enzyme. Here we address whether APOBEC3B is broadly responsible for mutagenesis in multiple tumor types. We analyzed gene expression data and mutation patterns, distributions and loads for 19 different cancer types, with over 4,800 exomes and 1,000,000 somatic mutations.

Taken together with the comprehensive analyses presented here of expression data (Fig. 1), CG base-pair mutation frequencies (Fig. 2), local cytosine mutation signatures (Fig. 3), overall mutation loads (Fig. 4) and kataegis (Fig. 4c and Table 1), all available data converge on the conclusion that APOBEC3B is a major source of mutation in multiple human cancers.

APOBEC3B is upregulated in numerous cancer types.
APOBEC3B is upregulated in numerous cancer types.

Each data point represents one tumor or normal sample, and the y axis is log transformed for better data visualization. Red, blue and yellow horizontal bars indicate median APOBEC3B levels relative to TBP levels for each cancer type (Table 1), the median values for each set of RNA-seq data from normal tissues (Supplementary Table 1) and individual qRT-PCR data points, respectively. Asterisks indicate significant upregulation of APOBEC3B in the indicated tumor type relative to the corresponding normal tissues (P < 0.0001 by Mann-Whitney U test). P values for negative or insignificant associations are not shown.

Full size image

Within breast cancer, the HER2-enriched subtype was clearly enriched for tumors with the APOBEC mutation pattern, suggesting that this type of mutagenesis is functionally linked with cancer development. The APOBEC mutation pattern also extended to cancer-associated genes, implying that ubiquitous APOBEC-mediated mutagenesis is carcinogenic.

APOBEC signature mutations occurred at a higher frequency among carcinogenic mutations in the group of samples with high APOBEC presence compared to samples in which the APOBEC mutation pattern was not detected (Fig. 5).

APOBEC signature mutations in potential cancer drivers.
APOBEC signature mutations in potential cancer drivers.

The fraction of potential cancer-driving mutations that have an APOBEC signature was determined for samples with high (q value for the enrichment of the APOBEC mutation pattern ≤ 0.05;) and low (q value > 0.05) presence of an exome-wide APOBEC mutation pattern. Mutations were designated as potential cancer drivers by one of three criteria: (i) Benjamini-Hochberg–corrected q value < 0.05 after CRAVAT analysis, (ii) listing within the COSMIC database and (iii) located in a subset of genes in the Cancer Gene Census, whose alteration by missense or nonsense mutations can contribute to cancer. ***P < 0.0001 in a two-sided χ2 test comparing the number of APOBEC and non-APOBEC signature mutations in potential cancer drivers in samples with high and low presence of the APOBEC mutation pattern for a given criterion defining a driver. Corresponding analysis for non-driver mutations is provided for comparison.

Full size image

Author information


  1. University Pompeu Fabra

    • Abel Gonzalez-Perez,
    • David Tamborero &
    • Nuria Lopez-Bigas
  2. Sage Bionetworks

    • Adam Margolin
  3. Washington University School of Medicine

    • Li Ding
  4. National Institute of Environmental Health Sciences

    • Dmitry Gordenin
  5. Nature Genetics

    • Myles Axton

Corresponding authors

Correspondence to:

Author details

Additional data