Systematic analysis of alterations in the ubiquitin proteolysis system reveals its contribution to driver mutations in cancer

Abstract

E3 ligases and degrons, the sequences they recognize in target proteins, are key parts of the ubiquitin-mediated proteolysis system. There are several examples of alterations of these two components of the system that have a role in cancer. Here we uncover the landscape of the contribution of such alterations to tumorigenesis across cancer types. We first systematically identified new instances of degrons across the human proteome by using a random forest classifier and validated the functionality of a dozen of them, exploiting somatic mutations across >7,000 tumors. We detected signals of positive selection across known and new degron instances. Our results reveal that several oncogenes are frequently targeted by mutations that affect the sequence of their degrons or their cognate E3 ubiquitin ligases, causing an abnormal increase in their protein abundance. Overall, an important number of driver mutations across primary tumors affect either degrons or E3-ubiquitin ligases.

Main

Targeted degradation of proteins via the ubiquitin-mediated proteolysis system (UPS) constitutes a key step in the maintenance of protein abundance within homeostatic levels and in the spatiotemporal control of the level of proteins that regulate crucial cellular processes1,2,3,4,5,6,7,8,9,10. Recognition of specific proteins for degradation is achieved through the interaction of E3 ubiquitin ligases (E3s), of which around 600 are encoded in the human genome, with short sequences within the target called degrons. While N-terminal and C-terminal degrons are often only one or two amino acids long (which are exposed by the action of peptidases), internal degrons are sequences spanning three to ten amino acids, frequently flanked by phosphorylation sites. The focus of this study is internal degrons (degrons, for simplicity), which, like transcription factor binding sites, are degenerate. We thus use the term motif to denote the representation (as a regular expression) of the collection of sequences that represent a degron. We refer to each of these sequences as a degron instance.

Upon binding of the cognate E3 to a degron instance, a second enzyme (E2, ubiquitin conjugating) transfers a ubiquitin, previously loaded onto it by an E1 activating enzyme, to a lysine of the target protein11. Typically, polyubiquitin chains linked through residue K48 or K11 of the ubiquitin polypeptide work as the signal for proteasomal degradation, although this can also be triggered following monoubiquitination12. The system also comprises deubiquitinases that counteract the activity of the E2–E3 complexes13.

Deregulation of the UPS resulting in changes in the stability of certain proteins has long been linked to human diseases14, and some mutations affecting the UPS are known to have a role in tumorigenesis9,15. Here we explore the pervasiveness across tumorigenesis of alterations of UPS mechanisms that lead to abnormal stabilization of oncoproteins. Tackling this question has not been possible so far due to the paucity in the identification of E3–target links, currently comprising close to 200 annotated instances of 28 degron motifs (for less than 5% of E3s) across 107 proteins9,16. To address this problem, we first identified likely new degron instances by employing a random forest classifier trained on their biochemical features and exploited the somatic mutations in tumors as a ‘natural experiment’ to functionally validate these new degron instances and detect other de novo degron instances. We then identified known and new degron instances under positive selection and calculated their total contribution to tumorigenesis. Overall, we estimated that alterations of UPS elements (degron instances and E3s) contribute at least 10% of driver mutations across primary tumors.

Results

New instances of known degrons identified with a machine learning approach

Experimentally identified degron instances have several biochemical features in common6. We used these properties to train a machine learning classifier to identify new instances across the human proteome. A set of 11 biochemical features, previously studied on smaller degron sets6,9, was first shown to differ between a group of 180 known degron instances (Supplementary Table 1) and sequences of the same length randomly drawn from the human proteome (Fig. 1a,b and Extended Data Fig. 1a,b). We trained a random forest classifier on these 11 biochemical features by using the set of known degron instances as a positive set and the same number of random sequences of comparable length drawn from all annotated protein isoforms of the human proteome17 as a negative set. Ten rounds of stratified fivefold cross-validation yielded an average area under the receiver operating characteristic (ROC) curve of 0.92 (Fig. 1c and Extended Data Fig. 1c–g). The classifier exhibited similar performance when tested on two independent datasets of experimentally identified degron instances of FBXW1118 and FBXW719 (Methods, Supplementary Note and Extended Data Fig. 1h–j). For brevity, throughout the manuscript, we use gene symbols to refer to their protein products.

Scanning the human proteome with the 28 known degron motifs (Fig. 1d) produced 84,835 matches (motif match in Fig. 1d,e and Supplementary Data), many of which are likely false positives (as shown by the correlation between the number of motif matches per protein and protein length; Extended Data Fig. 1k). Then, all motif matches were evaluated by the classifier and ranked according to the score provided by it (degron probability). In all, 20,929 matches in different protein isoforms possessed degron probability greater than 0.5 (new degron instances in Fig. 1d,e and henceforth). The number of new degron instances showed only weak correlation with the length of the proteins containing them (Extended Data Fig. 1l,m).

All motif matches and new degron instances annotated with further information are presented in Supplementary Table 2. For example, we highlight the 565 new instances with degron probability above that of the lowest scoring known instance of the corresponding motif (Extended Data Fig. 2) and which occurred in proteins known to interact with the cognate E3s20.

Validating the functionality of computationally identified degrons

To functionally validate new degron instances identified in the previous section, one could compare the stability of a protein carrying an alteration in the degron to that of its wild-type form in a cell system. Following this idea, we employed human tumors as natural mutagenic experiments. We gathered measurements of the abundance of 236 proteins (through reverse-phase protein assay (RPPA)), the RNA level of their transcripts and somatic coding mutations across 6,909 primary tumors obtained by The Cancer Genome Atlas (TCGA)21,22. We collected the same data for 212 proteins across 861 cancer cell lines probed by the Cancer Cell Line Encyclopedia (CCLE) and the MD Anderson Cancer Center22,23 (Supplementary Table 3).

We computed a robust regression between the levels of each protein and the corresponding mRNA across non-mutated samples (Fig. 2a–c), obtaining the expected protein level in tumors from the observed mRNA level. The distance between the observed protein level and the expected value (residual) was re-scaled to account for the standard deviation of the protein and mRNA levels observed for all wild-type instances of the protein (referred to as the stability change of the protein hereafter; Supplementary Note and Fig. 2d–f). While, on average, missense mutations had the effect of reducing the stability of proteins, a substantial fraction of them were stabilizing (Fig. 2g,h).

For example, 38 of 40 nonsynonymous mutations and in-frame indels mapping to NFE2L2 across primary tumors affected a KEAP1 degron or its flanking residues (Fig. 3a), probably interfering with its recognition24. The stability of NFE2L2 was significantly higher in these 38 NFE2L2-mutant tumors than in the 473 tumors carrying the wild-type protein (Fig. 3b). Analogously, 59 primary tumors and 16 cancer cell lines harbored 22 unique nonsynonymous mutations and 11 in-frame indels affecting the CTNNB1 BTRC degron25 (Extended Data Fig. 3a,b). The stability of CTNNB1 in primary tumors or cancer cell lines with degron alterations was significantly higher than in those carrying the wild-type protein or other missense mutations or in-frame indels affecting CTNNB1 (Fig. 3c). Nonsynonymous mutations and in-frame indels mapping to a new FBXO31 degron instance in CCND1 (Supplementary Table 2), for which experimental evidence exists26,27, showed the same stabilizing effect (Fig. 3d).

Across primary tumors, proteins carrying 146 nonsynonymous substitutions or in-frame indels affecting known or new degron instances exhibited overall significantly higher stability than those affected by other mutations or their wild-type forms (Fig. 3e). Proteins with mutations affecting phosphorylation residues located close to known or new degron instances were also significantly more stable than the same protein bearing other mutations and their wild-type forms. This significant increase in stability was driven by mutations in both known degrons and new degrons, and it was more apparent for degrons with higher probability (Fig. 3f and Extended Data Fig. 3c–h). Furthermore, the stability of proteins bearing degron-affecting mutations in the top quartile of variant allele frequency (VAF) was significantly higher than that of proteins with degron-affecting mutations in the bottom quartile of VAF (Fig. 3g). On the other hand, the VAF of nonsynonymous mutations that did not overlap known or new degron instances had no discernible effect on change in protein stability (Extended Data Fig. 3j). Finally, most of the global increase in protein stability resulting from degron-affecting mutations in tumors could be explained by known oncogenes (Fig. 3h). Analysis of protein abundance across 105 breast and 175 ovarian primary tumors measured through mass spectrometry (MS) revealed the same pattern of stabilization of proteins with mutant degron instances (Extended Data Fig. 3i). (In all comparisons, TP53 mutations, known to trigger stabilization of its protein product28,29, were filtered out.) Taken together, these results demonstrate that the machine learning-based approach described above produces sequence matches enriched for true instances of annotated degrons.

Highly stabilizing mutations identify potential de novo degrons

We reasoned that highly stabilizing mutations (that is, those in the top quartile of the distribution of stability change values; Fig. 4a) that did not overlap known or new degron instances (red dots in the figure) could affect degrons with a still unknown motif. To discover some of these de novo degrons, we searched for proteins with regions of high degron probability bearing recurrent highly stabilizing mutations. Briefly, moving a rolling seven-amino-acid-wide window along the sequence of proteins affected by two or more highly stabilizing mutations, we computed the degron probability (by using the classifier) of each window, hence generating a degron probability profile of the entire protein sequence (Fig. 4b). Peaks in this profile overlapping amino acids affected by two or more highly stabilizing mutations may correspond to regions comprising de novo degrons. We required that all mutations mapping to the putative region cause, on average, a greater change in stability than other nonsynonymous mutations affecting the protein (that is, a z score higher than 0.5; Fig. 4c). De novo degrons may thus be represented by two values (the mean protein stability change triggered by mutations and the degron probability) in a two-dimensional graph (Fig. 4d). The four known degrons in our dataset bearing two or more highly stabilizing mutations (the BTRC degron of CTNNB1, the two KEAP1 degrons of NFE2L2 and the CBL degron of MET) correspond to stretches of protein sequence with peaks of degron probability (Extended Data Fig. 4a–c).

We found 19 regions of 17 proteins containing de novo degrons (Fig. 4e), for 5 of which there was evidence of direct interaction with at least one E3 (circles with border). We identified, for example, a de novo degron with two mutations mapping between amino acid residues 1298 and 1306 of ERBB3 (Fig. 4f), an oncoprotein known to be under the regulation of the RNF41 and NEDD4 E3s30,31. A region (308–322) of MAPK1 mutated in 11 primary tumors was also found to be likely to contain a de novo degron (Fig. 4g). Aspartate residues at positions 316 and 319 of MAPK1 are known to be important for phosphorylation and ubiquitination by MAP3K132, but the degron has not been identified so far. Another de novo degron appeared between residues 628 and 633 of PRKCA, a known target of at least two E3s: a complex integrated by RBCK1 and RNF3133 and TRIM41 (Extended Data Fig. 4d)34. Another de novo degron in BRAF flanked an annotated FBXW7 degron, which recent studies have suggested was not the sole degron responsible for the regulation of BRAF abundance35,36. ARAF, one of the three Ras-binding proteins in the Ras–MAPK pathway known to interact with the E3 HERC237, was found to carry another de novo degron. One of the mutations mapping to this degron has been observed to produce elevated MAPK1 and MAPK3 activity38. The 19 regions containing potential de novo degrons annotated with the information described above are presented in Supplementary Table 2.

Degron-affecting mutations are positively selected in tumorigenesis

We reasoned that the mutational pattern of degrons involved in tumorigenesis would exhibit evidence of positive selection across tumor samples, a principle that has been widely employed to detect cancer driver genes39,40,41,42,43. We thus developed a method, SMDeg, that probes the over-representation of missense mutations in instances of known or de novo degrons with respect to the number inferred from the distribution of all missense mutations observed in the protein. SMDeg first counts the number of observed missense mutations mapping to a degron and its 11 upstream and downstream flanking amino acids and the number of mutations mapping outside this window: six and two mutations, respectively, in the example in Fig. 5a. The same number of nonsynonymous mutations is then randomly placed 1,000 times along the sequence of the gene according to the mutational probability of each base, derived from the base’s pentanucleotide context42. Finally, the observed and average simulated number of mutations mapping within and outside the degron are compared by using a G test of goodness of fit, which produces the SMDeg P value. New, known and de novo degron instances with a false discovery rate (FDR) below 1% were deemed to be positively selected in tumorigenesis.

We also developed FMDeg, a method that computes the deviation of the average functional impact of missense mutations in new degron instances from the expected impact (Fig. 5b, left). To assess the potential impact of individual missense mutations on the capacity of a degron instance to bind its cognate E3, we designed an ad hoc score: degFI. degFI is based on some of the degron biochemical features included in the classifier described above (Fig. 5b, right). Mutations are scored depending on how many of the features included in degFI they fulfill (Methods and Supplementary Note). Proteins with mutant degrons with high degFI scores showed a significantly greater stability change than mutant proteins with low degFI scores (Fig. 5b). This demonstrates the usefulness of degFI in assessing the functional impact of mutations mapping to new degron instances. FMDeg computes the average degFI of the mutations observed in each degron match and the distribution of the average degFI from 10,000 samples with the same number of mutations in the matching sequence, drawn according to the mutation probability of each nucleotide in the gene as described above for SMDeg. Finally, it derives an empirical P value from the fraction of random samples with average degFI higher than (or equal to) the average degFI of observed mutations42. New degron instances with FDR < 10% (Benjamini–Hochberg) were deemed to be positively selected in tumorigenesis.

We used these tests to analyze the 20,929 new (and de novo) degron instances, both in a pan-cancer manner and within each cohort (Fig. 5c,d, Supplementary Table 4 and Supplementary Data). Both methods showed correct calibration (Extended Data Fig. 5a,b). In primary tumors, 15 of 26 genes (4 of 12 in CCLE; Extended Data Fig. 5c,d) bearing new or de novo (Extended Data Fig. 5e) degron instances that were significant or nearly significant according to both tests (Fig. 5e) were known drivers (according to the Cancer Gene Census; Fig. 5f). Oncogenes (7 genes) were significantly over-represented among the 26 genes with driver degrons (Fisher’s odds ratio (OR) = 9.4; P = 4.3 × 10–5).

Several known and new degron instances, such as those in NFE2L2, CTNNB1, MYCN, CCND1 and EPAS1, were significant according to both methods (Fig. 5c–e,g,h). Interestingly a new instance of the APC degron in PCF11, a protein involved in the processing and maturation of mRNA, was also among the top ranking cases (Fig. 5c–e,i). The de novo degrons detected in MAPK1 and PIK3R1 were highly significant in the pan-cancer analysis (Extended Data Fig. 5e). The degrons of NFE2L2, CTNNB1, CCND1 and others were also significantly enriched for in-frame indels (one-tailed Fisher’s test; Supplementary Table 4). Finally, the degrons of 115 proteins may be disrupted by fusion events in at least two primary tumors in the cohort. For example, a COP1 degron close to the N terminus of ETV4 was abrogated via fusions of the gene with several partners in four tumors (Fig. 5i) using two different breakpoints.

Nine degron instances (new and known) were also identified as drivers across mutant cancer cell lines. Interesting candidates including ETV5 (encoded by a paralog of ETV4 with a new instance of an FBXW7 degron that was significant by SMDeg; Extended Data Fig. 5f), CCND3 (encoded by a paralog of CCND1 with an instance of an FBXO31 degron that was significant by both SMDeg and FMDeg; Extended Data Fig. 5g) and USP36 (a deubiquitinase that regulates MYC stability44 with a recurrently mutated instance of an SPOP degron; Extended Data Fig. 5g) were identified as well.

In summary, we uncovered 35 degrons under positive selection across primary tumors and cancer cell lines (Supplementary Table 4 and Supplementary Data), showing that the contribution of degron-affecting mutations to tumorigenesis goes far beyond the few currently known examples6,9.

The downstream effect of mutations of driver E3s

Measurement of protein stability changes across tumors also provides a way to assess the downstream effects of alterations of driver E3s. We first identified E3s with signals of positive selection in tumorigenesis by using dNdScv43 and OncodriveFML42. Any E3 deemed significant by either method (in a cohort of a particular cancer type or across cancer types) was considered to be a driver E3 (Fig. 6a and Supplementary Table 5). Thirty-seven E3s, including FBXW7, SPOP, APC, KEAP1, MAP3K1, VHL and RNF43, appeared significant in at least one cohort of primary tumors or across cancer cell lines (Extended Data Fig. 6a). Twenty-one had not been detected in a recent analysis carried out across TCGA primary tumors15 and did not appear in the Cancer Gene Census45 (Extended Data Fig. 6b,c). Nevertheless, the overlap between these two lists of driver E3s included the best known cases.

We reasoned that, just like mutations mapping to degrons, mutations affecting E3s (in particular their specific binding of degron sequences) could also impact the stability of target proteins. Thus, we compared the stability of proteins in the samples with mutations of a given driver E3 and those in which the same E3 was not mutated. The stability of CCNE1, MAPK3, NFE2L2 and HIF1A, for example, was significantly higher across colorectal, breast, lung and kidney tumors bearing mutations affecting FBXW7, MAP3K1, KEAP1 and VHL, respectively, than across tumors of the same cohorts with wild-type E3s (Fig. 6b).

Systematic application of this analysis to all proteins with RPPA information across primary tumors (Fig. 6c) and cancer cell lines (Fig. 6d) identified several significantly stabilized targets (E3–target pairs that are known to interact and which therefore constitute cases of likely direct stabilization are outlined with a black circle in the figure). This analysis picked up some well-known E3–target relationships (for example, FBXW7–CCNE1 and KEAP1–NFE2L2), but also highlighted new interesting links. The potential ubiquitination of MAPK1 via MAP3K1 through a de novo degron discussed above was given further credence by the discovery that mutations affecting MAP3K1 in breast tumors significantly increased the stability of MAPK3, encoded by a paralog of MAPK1 with high sequence identity. TSC2 was significantly more stable in breast and uterine carcinomas with mutations affecting UBE3A. PRKCA, a protein in which we identified a de novo degron, was significantly stabilized in the context of mutations mapping to two E3s, RNF31 and SMURF1. Interestingly, the increase in stability was more apparent for a form of the protein phosphorylated at S657, located 14 amino acids from the C-terminal end of the de novo degron (Extended Data Fig. 4d). Information on driver E3s and some of their potential targets is detailed in Supplementary Table 5.

Disruption of the UPS is an important contributor to tumorigenesis

Oncoproteins may exert their roles in tumorigenesis through an increase in their activity or the number of their units in the cell (Fig. 7a). For example, the increase in the number of genomic copies of an oncogene in a tumor (such as of MYC in breast adenocarcinomas) or its fusion to another gene (for example, BRAF in piloastrocytomas) may result in overexpression of its protein product. Alterations that disrupt the targeted degradation of an oncoprotein, for example, affecting its cognate E3 or its degron, produce a similar outcome (Fig. 7b,c). The level of the CCNE1 protein product, for example, was higher in tumors bearing a mutation mapping to its degron, an amplification of the CCNE1 gene and/or a mutation affecting FBXW7 (Fig. 7b) than in tumors with none of these alterations. Overall, almost 10% of all tumors in the TCGA pan-cancer cohort carried at least one alteration that resulted in an increase in the abundance of the CCNE1 protein product.

We aimed to estimate the global contribution of mutations affecting the UPS to tumorigenesis by computing the separate contribution of driver mutations affecting E3s and degron instances. For the former, we focused on the 37 E3s with at least one signal of positive selection across any cohort of primary tumors in TCGA (Extended Data Fig. 6b,c). We computed the excess of nonsynonymous, nonsense or splice-affecting mutations in the set of driver E3s across the pan-cancer cohort by using dNdScv (Fig. 7d), as well as their contribution to the excess mutations in each cohort (Fig. 7e). On average, there were almost two driver mutations in E3s for every ten primary tumors in TCGA. The contribution in some cohorts was greater, with colorectal tumors carrying, on average, more than one driver mutation affecting an E3, mostly APC or RNF43. Moreover, almost 1 in 15 primary tumors carried a nonsynonymous driver mutation affecting one of the 26 driver degron instances identified across primary tumors (Fig. 7f,g). Liver carcinomas stand out, with one of four carrying a driver mutation affecting the BTRC degron of CTNNB1. The combined driver mutations contributed by E3s and degrons represented more than 10% of all driver mutations affecting a list of 412 well-known driver genes45 (Fig. 7h). Because some of the most highly stabilizing mutations (among the 236 proteins with RPPA data) did not map to known or new degron instances, this is most likely an underestimation of the contribution of the UPS to tumorigenesis (Fig. 7i).

Finally, we reasoned that loss-of-function alterations of E3 ligases could in principle be therapeutically targeted via inhibitors of their overabundant downstream targets. We computed the number of patients in TCGA cohorts who could benefit in principle from this ‘indirect’ repurposing of anticancer drugs in clinical or preclinical stages (Fig. 7j). In each TCGA cohort, we counted the number of tumors that had mutations affecting known E3 targets that constitute known biomarkers of anticancer drug response, obtained from the Cancer Genome Interpreter46 (labeled as direct biomarker). For example, 67 (15%) TCGA uterine adenocarcinoma samples carried CCNE1 amplification, which constitutes a preclinical stage biomarker of response to CDK2 inhibitors (Extended Data Fig. 7). We also computed the number of tumors with alterations of the cognate E3 ligases, which result in the increase of the stability of the target protein (labeled as E3 ligase biomarker with evidence). Following on the previous example, we found 29 (6%) TCGA uterine adenocarcinoma samples with mutations of FBXW7 resulting in an increase in CCNE1 stability, which could in principle be targeted by using the same CDK2 inhibitors. Finally, an additional 41 (9%) FBXW7-mutant uterine adenocarcinoma samples with mutations of the E3 ligase without evidence of increased stability of the target (labeled as E3 ligase biomarker without evidence) could potentially be targeted by using these drugs. Repurposing anticancer drugs to target loss-of-function mutations of E3s could potentially benefit large proportions of tumors in certain cohorts, such as colorectal adenocarcinomas, which included 77 (17%) samples with an E3 ligase biomarker with evidence and 150 (33%) samples with an E3 ligase biomarker without evidence (Supplementary Table 6).

Discussion

Identifying the elements that integrate the UPS is key to understanding the operation of the turnover of proteins in the cell, the regulation of key processes, such as the cell cycle, and the role that their dysregulation has in disease3,9,14. Here we systematically identified new instances of known degrons across human proteins, and we provide evidence of the functionality of some of them. While the new degron instances identified constitute the result of bioinformatics analyses, those supported by several lines of evidence (thoroughly annotated in Supplementary Table 2) represent a bona fide shortlist for experimental validation.

The existing data posed limitations to the approach taken in this study. On the one hand, only a few instances of a handful of degron motifs (and only sequence-based features) were available to train the classifier. On the other hand, the detection of protein abundance is still limited to either a small set of proteins (RPPA) or a small set of samples (MS). These limitations will be overcome by increasing data availability. As more degron instances are identified and their three-dimensional (3D) structures in complex with cognate E3 ligases are determined, structure-based features will be available to train classifiers to recognize true degron instances. Moreover, as more data on protein abundance become available (as expected, for example, from the enhanced Genotype–Tissue Expression (GTEx) consortium)46, application of the approach presented here will result in the discovery of yet-unknown degrons in a truly proteome-wide manner.

Caution must also be exercised in the interpretation of results. First, although the genetic variants analyzed might produce the observed effect through other mechanisms, such as disruption of microRNA binding sites47 or changes in translation efficiency, this problem was minimized by analyzing only coding mutations mapping to potential degrons and comparing with the effect of other mutations of the same proteins. Second, whereas driver mutations mapping to degrons might have a role in tumorigenesis through their effect on specific functions of the protein (rather than in the disruption of targeted degradation), this should occur only rarely, as degrons are depleted for protein functional domains (Fig. 1a,b). Third, in cancer cells, other elements of the UPS may be altered besides the specific degron-altering mutation under analysis, thus complicating the attribution of causality to the effect on protein abundance. This caveat was addressed by removing samples carrying potentially obscuring UPS alterations (of E2 enzymes, E3 ligases, adaptors and deubiquitinases) and high-level amplifications or deletions of the target protein.

The outcomes of this work should be of interest to several research communities. The list of matches of known degron motifs, and their annotations, constitutes a valuable resource to UPS researchers and protein engineers. We introduced a new framework for analysis of the influence of any type of alteration (in cis or trans) on the stability of a given protein, as well as a new score of the impact of mutations in degrons. We developed and made publicly available two new methods to identify degrons (or any other amino acid sequence motif) under positive selection. Finally, we revealed the comprehensive landscape of UPS disruption in tumorigenesis, including assessing the effect of alterations in E3 ligases and their targets. Our results shed light on the cellular downstream effects of a specific subset of driver mutations (affecting the UPS), representing a key line of research to bridge the gap between cancer genomics and cancer personalized medicine48,49,50,51. Overall, we estimate that more than 10% of driver mutations attributable to well-known cancer-related genes affect elements of the UPS. We anticipate that the approaches developed and tested here should pave the way to uncovering a more complete landscape of the involvement of the UPS in tumorigenesis in the near future.

Methods

Data collection and preprocessing

Cell-line-specific RPPA data (CCLE_RPPA_20180123; accessed 10 October 2018), antibody information (CCLE_RPPA_Ab_info_20180123; accessed 10 October 2018), RNA-seq data (CCLE_DepMap_18q3_RNAseq_RPKM_20180718; accessed 10 October 2018), CNA data (CCLE_copynumber_byGene_2013-12-03; accessed 10 October 2018) and somatic mutations (CCLE_DepMap_18q3_maf_20180718; accessed 10 October 2018) were downloaded from the Broad Institute portal (https://portals.broadinstitute.org/ccle/data). The Supplementary Note presents additional filters and preprocessing applied to the CCLE dataset.

For MS data from human high-grade serous ovarian cancer (OV)52 and breast cancer (BRCA)53, TCGA cohorts were downloaded from the Clinical Proteomics Tumor Analysis Consortium by using TCGA-assembler 2 (ref. 54). The MS dataset contained 280 tumor samples, including 105 BRCA and 175 OV samples, and a total of 11,064 proteins with their level measured through MS.

The list of proteins involved in ubiquitination (UBSs) and deubiquitination (DUBs) was manually created by integrating previous knowledge from UniProt17 and E3NET55. The final list of proteins involved in ubiquitination included 977 identifiers of human proteins.

A curated list of E3 ligases and their degradation substrates was downloaded from http://pnet.kaist.ac.kr/e3net/ (17 October 2018). Non-human interactions were filtered out. The APC–CTNNB1 interaction was manually added to the list. The final list of proteins involved in ubiquitination included 833 interactions between E3 ligases and human substrates.

We downloaded human protein–protein interaction data from STRING20 (9606.protein.links.detailed.v10.5.txt.gz; 19 February 2018). Pairs with a STRING score below 300 were filtered out. The final list of 2,568,513 protein–protein interactions comprised 18,403 proteins, including 566 proteins involved in ubiquitination.

We merged the information (that is, RPPA or MS log ratio(iTRAQ), RNA, CNA and somatic mutations) available for each gene (represented by a HUGO symbol) in each TCGA primary tumor or CCLE cell line. The matched RPPA TCGA dataset contained 6,909 samples and 236 antibodies that recognized 193 proteins. The matched RPPA CCLE dataset contained 861 samples and 212 antibodies that recognized 171 proteins. The matched MS TCGA dataset contained 201 samples and 10,076 proteins.

Protein–mRNA regression

Given an antibody–tumor type pair in the TCGA dataset, the relationship between the protein level (measured by RPPA) and the mRNA level (measured by the log2-transformed value of the RSEM) was estimated by means of a robust linear regression approach, as we wanted to derive a linear model insensitive to outliers and high leverage points, with the fitting method of choice being Iteratively reweighted least squares (IRLS). Samples with mutations, high-level amplifications (that is, CNA = 2) or alterations in annotated upstream E3s were not included in calculation of the regression. For each gene in each sample, the residual, defined as the y-axis distance from the protein RPPA value to the regression line given the mRNA level, was calculated and then normalized (see the Supplemental Note for further information about normalization). MS cohorts followed a similar protocol where log ratio(iTRAQ) measured protein expression.

All specifics about calculation of raw residuals in the CCLE dataset are provided in the Supplementary Note.

To render the residuals comparable across different proteins and tumor types, it was necessary to correct for the influence of the dispersion of the samples at the protein and mRNA levels on the slope of the robust regression line that was used to compute the residuals. We ruled out distortion of the regression slope by unbalanced dispersions between the two axes by correcting both quantities by their respective standard deviations, thus obtaining a normalized residual, which we refer to as stability change.

$$\mathrm{Stability}\; \mathrm{change} = \frac{{\rm raw} \;{\rm residual}\cdot {\rm s.}{\rm d.}({\rm mRNA})}{{\rm s.}{\rm d.}\left({\rm RPPA}\right)}$$

The Supplementary Note provides further specifics about the normalization procedure, including examples that illustrate its necessity.

To measure the effect of missense mutations and in-frame indels on the stability of proteins, we compared the average stability change of the mutated samples to the average stability change of wild-type samples. This test was performed for all antibodies with at least five mutations across the TCGA cohorts. Samples with high-level amplification or deletion were not considered. Two-sided Mann–Whitney tests were used to evaluate significance.

Degrons

Amino acid sequences from 32,022 reviewed human protein isoforms were downloaded from UniProt17. Each sequence was associated with a UniProt isoform ID and a HUGO symbol.

Degron motifs (that is, consensus motifs for a particular E3 and adaptor) and degron instances in the human proteome were downloaded from ELM16 (http://elm.eu.org/downloads.html; 15 May 2019) and previous studies9. Further details about gathering and filtering of degron motifs are explained in the Supplementary Note.

All specifics about calculation of the 11 biochemical features are presented in the Supplementary Note.

All details about generation and selection of random protein sequences and calculation of z scores are presented in the Supplementary Note.

Degron random forest classifier

All specifics about the creation and validation of the training sets and test sets, model fitting, feature contribution and application of the classifier are presented in the Supplementary Note.

We first defined a clean set of missense somatic mutations from both the TCGA and CCLE datasets. The clean set included missense mutations in genes that did not harbor any other single-nucleotide variant (SNV) in the same sample. By applying this filtering, we retrieved 1,079,471 missense mutations from TCGA and 433,336 missense mutations from CCLE. Next, each missense mutation was uniquely classified, according to its localization, into one of the following classes: mutation_altering_motif, mutation_flanking_PTM, mutation_flanking_Ub_lysine, mutation_flanking_lysines, mutation_flanking_degron and other_missense.

We first defined a clean set of in-frame indels from both the TCGA and CCLE datasets. The clean set included in-frame insertions or deletions in genes that did not harbor any other SNV in the same sample. By applying this filtering, we retrieved 8,355 in-frame indels from TCGA and 4,580 indels from CCLE. Next, each in-frame indel was classified, according to its localization, into one of the following classes: in_frame_altering_motif, in-frame_altering_flanking_PTM, in-frame_altering_flanking_Ub_lysine, in-frame_altering_flanking_lysines, in-frame_altering_flanking_degron and other_in_frame.

3D visualization of degron–E3 interactions

Chimera57 software was used to visualize interaction of the N-terminal degron of NFE2L2 with KEAP1 (PDB 3WN7) and the BTRC degron in CTNNB1 (PDB 1P22). The Chimera mutate_residue tool was used to generate 3D models of the NFE2L2 D29H and CTNNB1 S37C mutants.

Statistical tests of degron alterations

We performed several comparisons of the effect of degron-affecting mutations against the effect of other missense mutations and the wild-type form of the proteins. In all comparisons, samples with SNVs in the E3 ligase, high-level amplifications or deletions of the substrate, or alterations that disrupted the epitope of the antibody (and decreased the measured stability) were not considered. Because of the extreme effect of TP53 alterations, mutations affecting TP53 were filtered out of the analysis. Two-sided Mann–Whitney tests were used to evaluate statistical significance. Specifics about each individual comparison can be found in the Supplementary Note.

Identification of de novo degrons

We first selected missense mutations and in-frame indels from TCGA that did not map to any degron match or its flanking positions and with a stability change value. Mutations in samples with altered upstream E3 ligases, with high-level amplification or deletion, or with alterations that disrupted the epitope of the antibody were filtered out. A total of 9,322 missense mutations and 259 in-frame indels were selected for further analysis. We next ranked these mutations according to their stability change values and selected mutations in the top 25%. These mutations were considered highly stabilizing. A total of 2,310 missense mutations and 79 in-frame indels made up the set of highly stabilizing alterations.

We then created a protein-wide profile of degron likeness and selected the regions to which at least two highly stabilizing mutations mapped (a step-by-step explanation of the methodology is available in the Supplementary Note).

Positive selection analysis of degrons

We confined the search for positive selection in degrons to new and known degron instances (degron probability > 0.5). For each of these instances, the regions to be analyzed contained all residues in the degron motif extended to 11 flanking amino acids on each side. All regions were subsequently mapped to GRCh37 genomic coordinates. To perform the mapping, we first retrieved, when available, the Consensus CDS (CCDS) identifier for each UniProt isoform ID. Motifs in isoforms lacking a CCDS were excluded. Next, for all amino acids in a degron region, we used TransVar58 to convert the protein position to genomic coordinates. Motif instances mapping to genomic coordinates on sex chromosomes were discarded. Genomic coordinates for 15,759 motif instances were successfully retrieved. A similar workflow was followed to retrieve the genomic coordinates of the 19 de novo degrons.

SNVs that were missense mutations were collected from the TCGA and CCLE datasets (see above for more information). For each TCGA cancer type, we removed hypermutated samples. Hypermutated samples were defined as samples for which the number of missense mutations was more than 3.5 times the interquartile range plus the 75th percentile of the distribution of missense mutations in all samples in that cohort (or in the pan-cancer cohort for the pan-cancer analysis). Moreover, a minimum of 1,500 missense mutations was needed for the sample to be considered hypermutated. One hundred two TCGA samples were considered to be hypermutated. A total of 1,010,732 missense mutations made up the dataset of mutations. Similarly, the CCLE dataset had 62 hypermutated cell lines as determined with the same rationale. A total of 378,200 missense mutations made up the CCLE dataset of mutations.

We used in-frame indels from CCLE and TCGA (see above for more information about the processing of in-frame indels) to perform the positive selection test.

Pan-cancer gene fusions in the TCGA cohort were collected from the Tumor Fusion Gene Data Portal (http://www.tumorfusions.org/; 10 July 2018). We focused on fusions that conserved the reading frame (that is, in-frame fusions) and with a reliable level of evidence (that is, tier 1, tier 2 or tier 3).

We developed SMDeg, a method to identify motif instances with a significantly higher number of missense mutations than expected by chance. For each query region within a protein in a cohort of samples, SMDeg determines whether the number of mutations is significantly higher than the expected number given the background mutation rate of the analyzed sample. Step-by-step instructions for the workflow alongside implementation details are presented in the Supplementary Note.

We developed a score that quantifies the impact of missense mutations on the functionality of a degron. All specifics about the calculation, justification and examples of degFI score are presented in the Supplementary Note.

OncodriveFML42 is a tool that detects genes under positive selection by analyzing functional impact bias in somatic mutations. Herein we defined functional impact as the effect on protein degradation, as measured by degFI. OncodriveFML requires as input the genomic regions to be analyzed. In our particular case, we used the genomic regions overlapping degrons defined above. For each genomic region of interest with at least two missense mutations in the cohort of interest, OncodriveFML with degFI score (hereafter named FMDeg) simulated an equal number of missense mutations within the genomic region of interest. We used 108 simulations to compute the empirical P value. Because degFI is defined to measure the impact of single-amino-acid changes in protein degradation, other types of SNVs such as synonymous mutations, splice-affecting variants and nonsense mutations were not considered.

We defined as driver degrons those new or known degron instances that were significant (q ≤ 0.1) or nearly significant (q ≤ 0.25) in FMDeg and significant in SMDeg (q ≤ 0.1). For de novo degrons, we used only the SMDeg test and considered as drivers those with q ≤ 0.1.

We estimated the number of driver degrons in genes that were already annotated in the Cancer Gene Census59 (download 5 June 2019).

All details about the positive selection test for in-frame indels are presented in the Supplementary Note.

We detected gene fusions from TCGA (see above) that led to total loss of a predicted instance of a degron. Further details about the procedure to map fusions to degrons are presented in the Supplementary Note.

Analysis of E3 ligase alterations and their role in tumorigenesis

To detect E3 ligases with signals of positive selection in primary tumors and cancer cell lines, we ran OncodriveFML42 (default coding parameters, including indels, sampling of 108 iterations and CADDv1.4 (ref. 60) as functional impact score) and dNdScv43 (default parameters). To run both methods, we removed hypermutated samples from TCGA and CCLE, as described above for SMDeg and FMDeg tests. We filtered the output of both methods to narrow down the search to E3 ligases by using the curated list of proteins involved in ubiquitination. We used an FDR of 10% to define as significant the signal of an E3 ligase in a particular cohort. The union of the two outputs composed the set of significant E3 ligases in a dataset.

All specifics about the comparison of driver E3s with previous studies are reported in the Supplementary Note.

We analyzed the downstream effect of nonsynonymous variants in E3 ligases on their associated degradation targets. We discarded samples bearing genomic alterations (that is, CNAs and nonsynonymous mutations) in the substrates or other E3 ligases. A minimum of ten mutated samples were required to perform the comparison. Two-sided Mann–Whitney tests were used to evaluate significance. P values were adjusted with a multiple-testing correction by using the Benjamini–Hochberg procedure (alpha = 0.05). Step-by-step specification of the methodology is described in the Supplementary Note.

Driver mutation analysis

We estimated the number of mutations in excess (that is, the difference between the number of mutations observed and the number of mutations expected according to a neutral selection model) in driver E3s. To compute such estimates, we followed the methodology laid out by dNdScv43. Briefly, dNdScv provides a gene-specific estimation of the ratio of nonsynonymous to synonymous substitutions (dN/dS) that is corrected by (i) chromatin features explaining regional variability in the neutral mutation rate, (ii) the consequence type of substitutions and (iii) the mutational processes operative in the tumor. For our analysis, we grouped the nonsynonymous substitutions into three main consequence type groups: missense mutations, nonsense mutations and mutations affecting splicing.

Upon estimation of ω c for a given consequence type c, we can estimate the number of mutations in excess for each gene–cohort and consequence type c as $$e_c = \left( {\omega _c - 1} \right) \cdot m_c/\omega _c$$, where m c is the number of mutations observed with consequence type c. By adding the number of mutations in excess across a pool of genes (bearing signals of positive selection), we can provide an estimate of the number of driver mutations per sample. We used this rationale to calculate the number of driver mutations mapping to the 37 E3 ubiquitin ligases for each sample across the 32 TCGA cohorts (31 tumor types plus the pan-cancer cohort).

We performed an estimation of the number of missense mutations in excess (that is, driver mutations) in driver degrons across TCGA cohorts. We first defined a list of driver degrons as those degrons showing significant signals of positive selection according to both FMDeg and SMDeg tests across the TCGA cohorts (see above). A total of 26 genes bore degrons with significant signals of positive selection.

To estimate the number of missense mutations in excess in degrons, we reproduced the analytical steps of dNdScv to compute the gene-specific ω mis estimate, albeit with our own subgenic elements of choice. Thus, for any gene, we were able to compute ω mis for both the entire CDS and smaller coding subsequences. As degrons encompass only small stretches spanning ≤5% of the total CDS length, the method lacks the sensitivity to tackle the problem by direct analysis of the degron sequence. However, we can estimate ω mis for both the entire gene and the complement of the degron sequence, from which we can infer the number of missense mutations in excess for each. Finally, the number of missense mutations in excess in the degron can be given as

$$e_{\textrm{deg}} = e_{\textrm{gene}} - e_{\textrm{comp}}$$

where e gene and e comp are missense mutations in excess of the gene and degron complement, respectively, computed from the respective values of ω mis as described in the previous section.

We estimated the relative contribution of driver mutations involved in the UPS as compared to the total driver mutations across human primary tumors from TCGA. To do so, we first performed a sample-specific estimation of the total number of driver mutations in driver genes across the TCGA cohorts. We defined a dataset of 412 driver genes as the union of the 369 genes defined in the dNdScv manuscript, the 37 driver E3 ligases and the 26 driver degrons. We followed the same procedure described in previous sections to perform estimation of the total number of driver mutations per cohort. We then performed a sample-specific ratio between driver mutations associated with the UPS (that is, either driver mutations in E3 ligases or driver mutations in degrons) and all driver mutations in that sample. This was performed across all TCGA samples analyzed.

Methodological details of this section alongside a step-by-step explanation are presented in the Supplementary Note.

Statistics and reproducibility

Our paper consists of statistical analyses applied to somatic mutations, protein expression, mRNA expression and CNAs in two different cancer sequencing projects (TCGA and CCLE). All statistical tests developed in the study are described at length in the corresponding subsections of the Methods or of the Supplementary Note. Other statistical tests employed are detailed in figure legends. All tumor samples and cancer cell lines with the types of data needed for each analysis were included, except hypermutated tumors, which were excluded from certain analyses (see above). Many of these analyses, described above, were designed and implemented by us in Python by using interfaces such as IPython61 and readily available libraries such as pandas62 and numpy63. Details and results of these analyses (the number of samples, number of mutations, effect sizes or P values) are indicated for a few examples in pertinent Results sections and within the main figures. The details and results of all analyses carried out (both present and absent from the main figures) are presented in the Supplementary Information. Figures were constructed using Matplotlib64, seaborn65 and Bokeh66. All analyses may be readily reproduced by using the code developed in the study.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Code availability

All software and data produced as part of the study (including scripts needed to reproduce all results described in the paper) are available at https://bitbucket.org/account/user/bbglab/projects/PD.

References

1. 1.

Al-Hakim, A. et al. The ubiquitous role of ubiquitin in the DNA damage response. DNA Repair9, 1229–1240 (2010).

2. 2.

Arlow, T., Scott, K., Wagenseller, A. & Gammie, A. Proteasome inhibition rescues clinically significant unstable variants of the mismatch repair protein Msh2. Proc. Natl Acad. Sci. USA110, 246–251 (2013).

3. 3.

Bassermann, F., Eichner, R. & Pagano, M. The ubiquitin proteasome system—implications for cell cycle control and the targeted treatment of cancer. Biochim. Biophys. Acta Mol. Cell Res.1843, 150–162 (2014).

4. 4.

Ciechanover, A., Heller, H., Elias, S., Haas, A. L. & Hershko, A. ATP-dependent conjugation of reticulocyte proteins with the polypeptide required for protein degradation. Proc. Natl Acad. Sci. USA77, 1365–1368 (1980).

5. 5.

Gillette, T. G. et al. Distinct functions of the ubiquitin–proteasome pathway influence nucleotide excision repair. EMBO J.25, 2529–2538 (2006).

6. 6.

Guharoy, M., Bhowmick, P., Sallam, M. & Tompa, P. Tripartite degrons confer diversity and specificity on regulated protein degradation in the ubiquitin–proteasome system. Nat. Commun.7, 10239 (2016).

7. 7.

Hershko, A., Ciechanover, A., Heller, H., Haas, A. L. & Rose, I. A. Proposed role of ATP in protein breakdown: conjugation of protein with multiple chains of the polypeptide of ATP-dependent proteolysis. Proc. Natl Acad. Sci. USA77, 1783–1786 (1980).

8. 8.

Liu, Y., Beyer, A. & Aebersold, R. On the dependency of cellular protein levels on mRNA abundance. Cell165, 535–550 (2016).

9. 9.

Mészáros, B., Kumar, M., Gibson, T. J., Uyar, B. & Dosztányi, Z. Degrons in cancer. Sci. Signal.10, eaak9982 (2017).

10. 10.

Yoo, S.-H. et al. Competing E3 ubiquitin ligases govern circadian periodicity by degradation of CRY in nucleus and cytoplasm. Cell152, 1091–1105 (2013).

11. 11.

Stewart, M. D., Ritterhoff, T., Klevit, R. E. & Brzovic, P. S. E2 enzymes: more than just middle men. Cell Res.26, 423–440 (2016).

12. 12.

Braten, O. et al. Numerous proteins with unique characteristics are degraded by the 26S proteasome following monoubiquitination. Proc. Natl Acad. Sci. USA113, E4639–E4647 (2016).

13. 13.

Komander, D., Clague, M. J. & Urbé, S. Breaking the chains: structure and function of the deubiquitinases. Nat. Rev. Mol. Cell Biol.10, 550–563 (2009).

14. 14.

Vu, P. K. & Sakamoto, K. M. Ubiquitin-mediated proteolysis and human disease. Mol. Genet. Metab.71, 261–266 (2000).

15. 15.

Ge, Z. et al. Integrated genomic analysis of the ubiquitin pathway across cancer types. Cell Rep.23, 213–226 (2018).

16. 16.

Dinkel, H. et al. ELM 2016—data update and new functionality of the Eukaryotic Linear Motif resource. Nucleic Acids Res.44, D294–D300 (2016).

17. 17.

Bateman, A. et al. UniProt: the universal protein knowledgebase. Nucleic Acids Res.45, D158–D169 (2017).

18. 18.

Kim, T. Y. et al. Substrate trapping proteomics reveals targets of the βTrCP2/FBXW11 ubiquitin ligase. Mol. Cell. Biol.35, 167–181 (2015).

19. 19.

Arabi, A. et al. Proteomic screen reveals Fbw7 as a modulator of the NF-κB pathway. Nat. Commun.3, 976 (2012).

20. 20.

Franceschini, A. et al. STRINGv9.1: protein–protein interaction networks, with increased coverage and integration. Nucleic Acids Res.41, D808–D815 (2013).

21. 21.

Ellrott, K. et al. Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst.6, 271–281 (2018).

22. 22.

Li, J. et al. TCPA: a resource for cancer functional proteomics data. Nat. Methods10, 1046–1047 (2013).

23. 23.

Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature483, 603–607 (2012).

24. 24.

Shibata, T. et al. Cancer related mutations in NRF2 impair its recognition by Keap1–Cul3 E3 ligase and promote malignancy. Proc. Natl Acad. Sci. USA105, 13568–13573 (2008).

25. 25.

Liu, C. et al. β-Trcp couples β-catenin phosphorylation–degradation and regulates Xenopus axis formation. Proc. Natl Acad. Sci. USA96, 6273–6278 (1999).

26. 26.

Santra, M. K., Wajapeyee, N. & Green, M. R. F-box protein FBXO31 mediates cyclin D1 degradation to induce G1 arrest after DNA damage. Nature459, 722–725 (2009).

27. 27.

Li, Y. et al. Structural basis of the phosphorylation-independent recognition of cyclin D1 by the SCF FBXO31 ubiquitin ligase. Proc. Natl Acad. Sci. USA115, 319–324 (2018).

28. 28.

Lukashchuk, N. & Vousden, K. H. Ubiquitination and degradation of mutant p53. Mol. Cell. Biol.27, 8284–8295 (2007).

29. 29.

Wawrzynow, B., Zylicz, A. & Zylicz, M. Chaperoning the guardian of the genome. The two-faced role of molecular chaperones in p53 tumor suppressor action. Biochim. Biophys. Acta Rev. Cancer1869, 161–174 (2018).

30. 30.

Qiu, X.-B. & Goldberg, A. L. Nrdp1/FLRF is a ubiquitin ligase promoting ubiquitination and degradation of the epidermal growth factor receptor family member, ErbB3. Proc. Natl Acad. Sci. USA99, 14843–14848 (2002).

31. 31.

Huang, Z. et al. The E3 ubiquitin ligase NEDD4 negatively regulates HER3/ErbB3 level and signaling. Oncogene34, 1105–1115 (2015).

32. 32.

Lu, Z., Xu, S., Joazeiro, C., Cobb, M. H. & Hunter, T. The PHD domain of MEKK1 acts as an E3 ubiquitin ligase and mediates ubiquitination and degradation of ERK1/2. Mol. Cell9, 945–956 (2002).

33. 33.

Nakamura, M., Tokunaga, F., Sakata, S. & Iwai, K. Mutual regulation of conventional protein kinase C and a ubiquitin ligase complex. Biochem. Biophys. Res. Commun.351, 340–347 (2006).

34. 34.

Chen, D. et al. Amplitude control of protein kinase C by RINCK, a novel E3 ubiquitin ligase. J. Biol. Chem.282, 33776–33787 (2007).

35. 35.

Saei, A. et al. Loss of USP28-mediated BRAF degradation drives resistance to RAF cancer therapies. J. Exp. Med.215, 1913–1928 (2018).

36. 36.

Hernandez, M. A. et al. Regulation of BRAF protein stability by a negative feedback loop involving the MEK–ERK pathway but not the FBXW7 tumour suppressor. Cell. Signal.28, 561–571 (2016).

37. 37.

Galligan, J. T. et al. Proteomic analysis and identification of cellular interactors of the giant ubiquitin ligase HERC2. J. Proteome Res.14, 953–966 (2015).

38. 38.

Li, D. et al. ARAF recurrent mutation causes central conducting lymphatic anomaly treatable with a MEK inhibitor. Nat. Med.25, 1116–1122 (2019).

39. 39.

Bailey, M. H. et al. Comprehensive characterization of cancer driver genes and mutations. Cell173, 371–385 (2018).

40. 40.

Gonzalez-Perez, A. et al. IntOGen-mutations identifies cancer drivers across tumor types. Nat. Methods10, 1081–1082 (2013).

41. 41.

Tamborero, D. et al. Comprehensive identification of mutational cancer driver genes across 12 tumor types. Sci. Rep.3, 2650 (2013).

42. 42.

Mularoni, L. et al. OncodriveFML: a general framework to identify coding and non-coding regions with cancer driver mutations. Genome Biol.17, 128 (2016).

43. 43.

Martincorena, I. et al. Universal patterns of selection in cancer and somatic tissues. Cell171, 1029–1041 (2017).

44. 44.

Sun, X.-X. et al. The nucleolar ubiquitin-specific protease USP36 deubiquitinates and stabilizes c-Myc. Proc. Natl Acad. Sci. USA112, 3734–3739 (2015).

45. 45.

Futreal, A. et al. A census of human cancer genes. Nat. Rev. Cancer4, 177–183 (2004).

46. 46.

Tamborero, D. et al. Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations. Genome Med.10, 25 (2018).

47. 47.

Hausser, J., Syed, A. P., Bilen, B. & Zavolan, M. Analysis of CDS-located miRNA target sites suggests that they can effectively inhibit translation. Genome Res.23, 604–615 (2013).

48. 48.

Gonzalez-Perez, A. Circuits of cancer drivers revealed by convergent misregulation of transcription factor targets across tumor types. Genome Med.8, 6 (2016).

49. 49.

Gonzalez-Perez, A., Jene-Sanz, A. & Lopez-Bigas, N. The mutational landscape of chromatin regulatory factors across 4,623 tumor samples. Genome Biol.14, R106 (2013).

50. 50.

Frigola, J., Iturbide, A., Lopez-Bigas, N., Peiro, S. & Gonzalez-Perez, A. Altered oncomodules underlie chromatin regulatory factors driver mutations. Oncotarget7, 30748–30759 (2016).

51. 51.

Sabarinathan, R. et al. The whole-genome panorama of cancer drivers. Preprint at bioRxiv https://doi.org/10.1101/190330 (2017).

52. 52.

Zhang, H. et al. Integrated proteogenomic characterization of human high-grade serous ovarian cancer. Cell166, 755–765 (2016).

53. 53.

Mertins, P. et al. Proteogenomics connects somatic mutations to signalling in breast cancer. Nature534, 55–62 (2016).

54. 54.

Wei, L. et al. TCGA-assembler 2: software pipeline for retrieval and processing of TCGA/CPTAC data. Bioinformatics34, 1615–1617 (2018).

55. 55.

Han, Y., Lee, H., Park, J. C. & Yi, G.-S. E3Net: a system for exploring E3-mediated regulatory networks of cellular functions. Mol. Cell. Proteomics11, O111.014076 (2012).

56. 56.

Hornbeck, P. V. et al. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res.43, D512–D520 (2015).

57. 57.

Pettersen, E. F. et al. UCSF Chimera—a visualization system for exploratory research and analysis. J. Comput. Chem.25, 1605–1612 (2004).

58. 58.

Zhou, W. et al. TransVar: a multilevel variant annotator for precision genomics. Nat. Methods12, 1002–1003 (2015).

59. 59.

Sondka, Z. et al. The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers. Nat. Rev. Cancer18, 696 (2018).

60. 60.

Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet.46, 310–315 (2014).

61. 61.

Perez, F. & Granger, B. E. IPython: a system for interactive scientific computing. Comput. Sci. Eng.9, 21–29 (2007).

62. 62.

McKinney, W. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (O’Reilly Media, Inc., 2017).

63. 63.

Oliphant, T. E. Guide to NumPy (CreateSpace Independent Publishing Platform, 2015).

64. 64.

Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng.9, 90–95 (2007).

65. 65.

Waskom, M. et al. seaborn v0.5.0 Zenodo https://doi.org/10.5281/zenodo.12710 (2014).

66. 66.

Jolly, K. Hands-On Data Visualization with Bokeh: Interactive Web Plotting for Python Using Bokeh (Packt Publishing, 2018).

Acknowledgements

N.L.-B. acknowledges funding from the European Research Council (consolidator grant 682398) and the ERDF/Spanish Ministry of Science, Innovation and Universities–Spanish State Research Agency/DamReMap Project (RTI2018-094095-B-I00). A.G.-P. is supported by a Ramón y Cajal contract (RYC-2013-14554). IRB Barcelona is a recipient of a Severo Ochoa Centre of Excellence Award from the Spanish Ministry of Economy and Competitiveness (MINECO; Government of Spain) and is supported by CERCA (Generalitat de Catalunya). The results shown here are in whole or part based upon data generated by the TCGA Research Network. Data used in this publication were generated by the National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC).

Author information

F.M.-J. prepared and carried out most analyses, including development of their statistical framework. F.M. carried out the estimation of excess mutations and contributed to the development of the statistical framework to compute protein expression residuals. E.L.-A. contributed to the annotation of degrons, curation of antibodies for RPPA data and preparation of Excel files. N.L.-B. and A.G.-P. conceived and oversaw the study. F.M.-J., N.L.-B. and A.G.-P. drafted the manuscript. All authors participated in interpretation and discussion of the results and in the final version of the manuscript.

Correspondence to Nuria Lopez-Bigas or Abel Gonzalez-Perez.

Ethics declarations

Competing interests

The authors declare no competing interests.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Identification of novel degron instances.

(a) Distribution of the values of biochemical properties of annotated degron instances and equally long randomly chosen sequences from the human proteome. The p-values were derived from two-tailed Mann-Whitney tests. Left N: number of validated degron instances; right N: number of random protein sequences sampled from the proteome. (b) Over or under representation of each amino acid (Fisher’s exact test odds ratio) across the sequence of annotated degron instances. Significant cases (p-value < 0.05) are circled in black. Relevant numbers are defined in (a). (c) Stratified 5-fold cross-validation ROC curve (as Fig. 1c) of a random forest classifier trained on annotated degrons and random sequences from the same set of proteins. Relevant numbers are defined in (a). (d) Precision/Recall of the random forest classifier described in the main paper (5-fold cross-validation). Relevant numbers are defined in (a). (e) Stratified 5-fold cross-validation ROC curve of a random forest classifier trained as described in the main paper, but adding random features highly correlated to the 11 used in the main paper. Relevant numbers are defined in (a). (f, g) Biochemical features at the top of the list of importance according to the classifiers trained in the main paper and above, for panel c. Bars represent the mean importance of each feature across the dataset, with the whiskers representing one standard deviation. (h) Stratified 5-fold cross-validation ROC curve resulting from the classification (with the random forest classifier described in the main paper) of experimentally identified FBXW11 degron instances and amino acid sequences of the same length randomly sampled from human proteins. Number of positive and negative instances defined at the top of the panel. (i) Stratified 5-fold cross-validation ROC curve resulting from the classification (with the random forest classifier described in the main paper) of experimentally identified FBXW11 degrons and random amino acid sequences from proteins deemed non FBXW11 targets. Number of positive and negative instances defined at the top of the panel. (j) Stratified 5-fold cross-validation ROC curve resulting from the classification (with the random forest classifier described in the main paper) of experimentally identified FBXW7 degrons and amino acid sequences of the same length randomly sampled from human proteins. Number of positive and negative instances defined at the top of the panel. (k-m) Correlation between the length of proteins and the number of matches (k), novel degron instances (l), novel degron instances with annotations (m) in their sequence. The numbers shown (R-value) correspond to the Pearson’s correlation coefficient. The trendline and its confidence intervals are shown as a line and a shaded area, respectively. N: number of proteins (k), novel degron instances (l), or novel degron instances with further supporting information (m).

Extended Data Fig. 2 Distribution of degron probability of the matches of each motif.

Each plot corresponds to the matches identified of one degron across the proteome. Degron probabilities are represented as a frequency histogram (solid light purple bars for motif matches and solid dark purple bars for annotated degron instances) and as the corresponding kernel-smoothed distribution (purple lines). Dashed vertical lines mark the site of the distribution that corresponds to the annotated degron with lowest probability, used as threshold to select high-confidence novel degron instances. In degron motifs with no annotated degron instance (that is, without solid dark purple histogram), the selected threshold is set at the lowest degron probability of any annotated degron (that is, 0.65). Values for all individual degrons are presented in Supplementary Table 2 and Supplementary Data.

Extended Data Fig. 3 Mutations affecting degrons increase the stability of proteins.

(a) Needle-plot representing the distribution of primary tumor mutations along the sequence of CTNNB1 (analogous to that of NFE2L2 in main Fig. 3a). (b) One recurrent mutation (S37C) projected onto the 3D structure of the CTNNB1-BTRC complex. (c, d) Comparisons of protein stability change upon mutations analogous to those represented in main Fig. 3e,f, restricted to tumors in which the gene harboring the degron under analysis is diploid. As in Fig. 3f, all p-values shown in this figure are derived from a one-tailed Mann-Whitney test. When two rows of p-values appear, the top value corresponds to the comparison between the distribution of stability change values of mutations in different groups and that of wild-type forms of the proteins, and the bottom value to the comparison with all missense mutations in the dataset. (e) Distribution of protein stability change caused by mutations in novel degrons instances in different quartiles of degron probability. (f, g) Comparisons of protein stability upon mutations analogous to those represented in main Fig. 3e,f, but carried out using cancer cell lines mutations. (h) Same as panel (e) for cancer cell lines mutations. (i) Thirteen proteins carrying mutations in novel degron instances exhibit a clear trend towards stability increase (determined using mass-spectrometry rather than RPPA as in previous examples), although non-significant due to lack of statistical power. (j) Distribution of stability change of proteins with non-synonymous mutations in different quartiles of VAF (that is, present in different fractions of tumor cells) which do not overlap with known or novel degron instances. The p-values correspond to the comparison (one-tailed Mann-Whitney test) between the distribution of stability change values of mutations in each quartile with respect to wild-type forms of the proteins. N: number of mutations in groups (in all panels). Boxplots in all panels are defined as in Fig. 2.

Extended Data Fig. 4 Identification of de novo degrons.

Identification of annotated degrons in CTNNB1 (a), NFE2L2 (b) and MET (c), PRKCA (d), BRAF (e), and ARAF (f) using the approach devised to identified de novo degrons. The panels follow the same composition and color codes as those in Fig. 4f,g. In parentheses, the names of the corresponding antibodies. N: number of tumor samples in each group.

Extended Data Fig. 5 Positive selection in degrons.

(a, b) QQ-plots relating the observed and expected distributions of p-values produced by the SMDeg (a) and FMDeg (b) tests on the TCGA pan-cancer cohort. N: number of tumor samples. (c, d) Novel degron instances that appear significant (FDR < 1%) in the SMDeg (c), or significant (FDR < 10%) or nearly significant (FDR < 25%) in the FMDeg test (d) across cancer cell lines. N: number of cancer cell lines. (e) De novo degron instances that appear significant (FDR < 1%) in the SMDeg test across TCGA primary tumors. N: number of tumor samples. (fh) Needle-plots representing the distribution of mutations in cancer cell lines along the sequences of ETV5 (f; significant in SMDeg), CCND3 (g; significant in SMDeg and FMDeg), USP36 (h; significant in SMDeg).

Extended Data Fig. 6 Driver E3s.

(a) Driver E3s across cancer cell lines are identified through signals of positive selection detected by OncodriveFML and dNdScv. Analogous to main Fig. 6a. N: number of cancer cell lines. (b) The combination of the two methods of positive selection employed yields 37 driver E3s across primary tumors. The size of the driver E3s correlates with their mutation frequency across TCGA samples. (c) Overlap between the lists of driver E3s identified in the study (red), annotated in the Cancer Gene Census (green) or identified in a recent analysis15 of TCGA datasets (blue).

Extended Data Fig. 7 TCGA tumors with actionable UPS alterations related to CCNE1.

(a) The bars represent the proportion of tumors in each cohort with CCNE1 alterations that could be targeted directly via CDK inhibitors (dark blue), or with alterations of FBXW7, with (medium blue) or without (light blue) increased stability of CCNE1 which could in principle be targeted indirectly. In parentheses, number of tumor samples in each cohort. (b) Mean percentage (and standard deviations as whiskers) of driver mutations in either driver E3s or driver degrons that do not occur in known cancer genes. In parentheses, number of tumor samples in each cohort.

Supplementary information

Supplementary Information

Supplementary Note

Supplementary Table

Supplementary Tables 1–6

Supplementary Data

Raw files containing proteome-wide annotated matches of degron motifs, degrons and E3s under positive selection. A README file contains a detailed description of the files enclosed within the zip file.

Rights and permissions

Reprints and Permissions

Martínez-Jiménez, F., Muiños, F., López-Arribillaga, E. et al. Systematic analysis of alterations in the ubiquitin proteolysis system reveals its contribution to driver mutations in cancer. Nat Cancer 1, 122–135 (2020). https://doi.org/10.1038/s43018-019-0001-2

• Accepted:

• Published:

• Issue Date: