Systematic analysis of somatic mutations impacting gene expression in 12 tumour types

We present a novel hierarchical Bayes statistical model, xseq, to systematically quantify the impact of somatic mutations on expression profiles. We establish the theoretical framework and robust inference characteristics of the method using computational benchmarking. We then use xseq to analyse thousands of tumour data sets available through The Cancer Genome Atlas, to systematically quantify somatic mutations impacting expression profiles. We identify 30 novel cis-effect tumour suppressor gene candidates, enriched in loss-of-function mutations and biallelic inactivation. Analysis of trans-effects of mutations and copy number alterations with xseq identifies mutations in 150 genes impacting expression networks, with 89 novel predictions. We reveal two important novel characteristics of mutation impact on expression: (1) patients harbouring known driver mutations exhibit different downstream gene expression consequences; (2) expression patterns for some mutations are stable across tumour types. These results have critical implications for identification and interpretation of mutations with consequent impact on transcription in cancer.

We labeled the P(TSG) of the 30 negative control genes (blue colour), as well as 21/23 predicted cis-effect tumour suppressor genes with high P(TSG) (orange colour). (b) The proportion of loss-of-function mutations distribution. The two vertical lines represent the estimated success rates of the two binomial distributions (blue: not tumour suppressor genes, orange: tumour suppressor genes). (c) The P(OCG) (oncogene) posterior distributions. We also labeled the 30 negative control genes (blue colour), as well as 20 selected oncogenes for illustration purpose. (d) The proportion of recurrent mutation distribution. The two vertical lines represent the estimated success rates of the two binomial distributions (blue: not oncogene, orange: oncogene).
Supplementary Fig. 15: Mixture-of-Binomial predictions concordance with the collected bona fide driver gene. (a,d) Concordance with bona fide driver genes as a function of classifiers' probabilities. (b,e) The number of genes above a given classifier probability threshold. (c,f) The number of predicted bona fide genes above a given classifier probability threshold. The first row (a-c) is for P(TSG) and the second row (e-f) is for P(OCG). Here we used a probability threshold of 0.2 to call gene harbouring enriched loss-of-function mutations or hot-spot mutations. If we set the threshold below 0.2, the number of predictions increase abruptly. If we used a more conservative threshold, e.g., 0.5, the results did not change much: 51/65 cis genes had P(TSG) >= 0.2, and 47/65 genes had P(TSG) >= 0.5.  AMOTL1  FAM120A  MED23  TOX4  ATF6  ESPL1  PLEKHM3  RIC8A  SEC63  ZC3H18  DDX27  RHOT1  AMOT  CELSR2  YLPM1  CEP290  GIGYF2  WDFY3  UBQLN1  MYOF  EIF2C1  MAZ  MAGEA4  PARD3  RB1CC1  CTNND1  MARK3  ITCH  TPP2       Supplementary Fig. 29: The trans-effects of bimodal distribution mutations. The heatmaps of the genes connected to the significantly mutated genes whose mutations showed bimodal distributions in xseq probabilities P(F).

POLQ_UCEC
Supplementary In summering the results, we only keep the gene which have at least 5 mutations in both the discovery and validation datasets in repeated cross-validation analysis. We also compare the cross-validation predictions with the predicted genes from the original analysis. Here only copy number alteration were reported. The copy number calls for METABRIC data were from a hidden Markov model based approach HMM-dosage 7, 8 , instead of GISTIC 9 used for the TCGA data. HMM-dosage is more conservative to call copy number homozygous deletions and amplifications based on the mutation frequency. The sensitivity issue of HMM-dosage partially explains why some genes were not predicted to have high probability in the METABRIC data simply because of lacking copy number calls.

Supplementary
1 Supplementary Discussions

Additional 17 genes with cis-effect loss-of-function mutations
Three novel predicted genes (AMOT, AMOTL1, and ITCH) encode proteins of the Hippo signalling pathway, involved in restraining cell division and promoting apoptosis. AMOT and AMOTL1 are the angiomotin family proteins. The AMOT family proteins physically interact with and inhibit YAP/TAZ, a transcription co-activator which plays a crucial role in organ size control by promoting cell proliferation and inhibiting apoptosis 10 . In addition, the AMOT family proteins can activate LATS2, a protein kinase which belongs to the LATS tumor suppressor family 11 . Note that AMOT encodes two proteins isoforms, and AMOT-p80 positively regulates MARK signalling pathway 12 .
ITCH is an anti-proliferation protein 13 . ITCH ubiquitinates c-FLIP, an inhibitor of tumour suppressor gene caspase-8, and induces its proteasomal degradation 13 .
Three genes (CTNND1, CELSR2, and PARD3) encode adhesion proteins, which play roles in metastasis. CTNND1 (cadherin-associated Src substrate) deletions in mice lead to tumor development 14 . It has been shown to be down-regulated in several cancer types 15 . Both CTNND1 and CELSR2 (cadherin, EGF LAG seven-pass G-type receptor 2) are part of the Wnt signaling pathway. It has been shown that CELSR2 was down-regulated in breast cancer cell lines and tumors. It is postulated that CELSR2 is a receptor and involved in cell adhesion and receptor-ligand interactions 16 . PARD3 (Par-3 Family Cell Polarity Regulator) is in the Signalling by TGF-beta Receptor Complex(R) pathway. This gene is deleted in both cell lines and primary tumors from squamous carcinomas and glioblastomas 17 , and reconstituting PARD3 expression restores tight conjunction and retards contact-dependent proliferation 17 .
Five genes (MED23, RB1CC1, YLPM1, EIF2C1, and MAZ) encode proteins that regulate transcription. MED23 encodes a protein that is a subunit of the Mediator complex, a co-activator involved in the regulated transcription of nearly all RNA polymerase II-dependent genes. MED23 is down-regulated in esophageal squamous cell carcinoma 18 , and its expression is negatively correlated with tumour size and clinical stages. Moreover, MED23 up-regulation significantly inhibits cell growth, and down-regulation promotes tumorigenecity in vitro and in vivo 18 . RB1CC1 is recurrently mutated in endometrial cancer and high-grade serous ovarian cancer. RB1CC1 encodes a protein that regulates RB1 expression 19 , and positively regulates transforming growth gactor-β signaling 20 . YLPM1 plays roles in telomerase activity during differentiation of embryonic stem cells. Genomic copy number deletions of YLPML are associated with poor prognosis in colorectal cancer 21 . EIF2C1 (AGO1, Argonaute RISC Catalytic Component) plays a role in RNA interference, and when over-expressed, slows down cell cycle, decreases cellular motility, and promotes apoptosis in neuroblastoma cell lines 22 . However, the functions of this gene may be cell type dependent because knockdown of EIF2C1 by siRNA also inhibites cell cycle progression in prostate cell lines 23 . MYC-accociated zinc finger protein (MAZ) is a transcription factor, which suppresses the expression of c-myc oncogene in COS cells and actives the expression of tissue specific genes 24 . This protein is a growth suppressor and suppresses the G1 phase of the cell cycle 25 . Down-regulation of MAZ results in reduced expression and promoter methylation of the tumour suppressor gene MLH1 26 . In contrast to many tumour suppressor genes, this gene is amplified and up-regulated in many cancer types.
SEC63, ATF6, and UBQLN1 are involved in protein processing in the endoplasmic reticulum. SEC63 encodes a protein that is associated with ribosome-free SEC61 complex. Down-regulation of SEC63 correlates significantly with decreased apoptosis and increased proliferation following exposure to diethylnitrosamine 27 . ATF6 (Activating transcription factor 6) regulates apoptosis in its active form 28 . As circumventing apoptosis is a hallmark of cancer, the gene is likely to be a novel tumour suppressor gene. The roles of UBQLN1 in cancer have been identified very recently 29 . Loss of UBQLN1 causes increased cell migration and invasion, actin cytoskeleton reorganization, and induction of epithelial-tomesenchymal transition 29 . Loss of UBQLN1 results in a significant decrease of E-cadherin gene expression. It is worth noticing that UBQLN1 and RIC8A (below) are in the same protein complex.

Additional 26 novel genes with trans-effect mutations
Six genes play roles in cell proliferation regulation -VEGFA, VEGFC, IRS1, INSR, E2F3, and BCL2L1 (Supplementary Data 5). VEGFA (vascular endothelial growth factor A) and VEGFC (vascular endothelial growth factor C) has many roles in cancer, in angiogenesis, vascular permeability, and in immune cells presented in the tumour environment. IRS1 (nsulin Receptor Substrate 1) deficiency promotes apoptosis and inhibits the growth of breast cancer cells. However, it has also been demonstrated that suppression of IRS1 promotes mammary tumour metastasis. Insulin receptor (INSR) play a key role in glucose homeostasis regulation, whose under degenerate conditions may cases diabetes and cancer. The E2F transcription factors regulate various biological processes, e.g., cell cycle regulation, DNA repair, apoptosis, centrosome duplication and differentiation 38 . Specifically, E2F3 down-regulation resultes in cell death and delays/blocks in cytokinesis 38 , consistent with E2F3 amplification in various tumour types. BCL2L1 (BCL2-Like 1) is in chromosome 20 and amplified in various tumours. It has been shown that of the 3 genes in a amplicon over expressed in human embryonic stem cells, only over expression of BCL2L1 provides control cells with growth characteristics similar to BCL2L1 copy number amplified cells 39 . In addition, inhibition of BCL2L1 protein expression results in strong reduction the growth rate of human embryonic stem cells 39 .
Twelve genes encode proteins that play roles in apoptotic process (PSMD1, PSMD13, ERN1, MCL1, PRKCQ, TJP1, MELK, MNDA, NET1, PHB, SULF1, and BCL2L1). PSMD1 and PSMD13 are members of the 26S proteasome non-ATPase regulatory subunit complex, and inhibition of the 26S proteasome induces apoptosis, and limits growth of human pancreatic cancer 40 . Suppression of ERN1 leads to reduced tumour growth through inhibition of angiogenesis and proliferation 41 . Myeloid cell leukemia 1 (MCL1) is amplified and over-expressed in many tumour types. Moreover, over expression of MCL1 is the resistance factor for several BCL2 family inhibitors 42 . It has been shown that MCL1 inhibitor has sufficient potency to kill a variety of cancer cells 42 . PRKCQ (Protein Kinase C, Theta) is required for the growth of triple negative breast cancer cells 43 . Even over-expression of PRKCQ is sufficient to drive oncogenic, growth factor-independent growth, survival and migration 43 . Down-regulation of TJP1 (tight junction protein 1) by its direct regulator miR-105 destroys vascular endothelial barriers to promote metastasis in cancer cells 44 . MELK is an oncogenic kinase essential for mitotic progression in basal-like breast cancer cells 45 . NET1 is a RhoA guanine exchange factor. NET1 up-regulation in gastric cancer drives the invasive phenotype 46 . PHB acts as a tumour suppressor and AR co repressor in prostate cancer 47 . SULF1 has been shown to promote tumorigenicity in ovarian cancer 48 .
Fifteen genes play roles in transcription regulation (ZNF217, SMARCA2, CBX8, MAF1, BZW1, ZBTB7B, EEF1A1, MSL3, PHB, MNDA, E2F3, PCGF3, ERN1, PRKCQ, and INSR). ZNF217 (Zinc Finger Protein 217) is a candidate breast cancer driver gene, and ZNF217-transduced cultures produce immortalized cells 49 . SMARCA2 encodes the core catalytic unit of the SWI/SNF ATP-dependent chromatin remodelling complex. CBX8 (chromodomain-containing protein 8) is an essential component of the polycomb repressive complexes 1, which directly regulates numerous target gene expression involved in cell-fate decisions 50 . MAF1 homolog (S. cerevisiae) is a down stream effector of the PTEN/PI3K signalling pathway and it plays a key role in this pathway to regulate lipid metabolism and oncogenesis 51 . BZW1 has been shown to promote growth of salivary muocepodermoid carcinoma 52 . ZBTB7B up-regulation contributes to breast cancer development 53 . It may be a cell-cycle modulator by regulating CDK2 and E2F4 54 . EEF1A1 is a substrate of the Type I TGF-β receptor 55 , and mutations of Ser300 affect TGF-β dependency in inhibition of protein synthesis and cell proliferation 55 .

Details of the message passing steps in xseq
To simplify our description of the inference algorithm, we use a simple example with just one mutated gene, and H is given. In addition, this gene is mutated in M patients and this gene is connected to N genes. Formally, at the first phase, message collection phase, the bottom-up messages (with superscript (−)) consist of (in sequential order): Once the root D receives all the incoming messages, it can update its belief by normalizing the product of these messages:

Details of parameter learning
Assume that we have the mutation data for T genes, and gene g is mutated in M g patients, and gene g connects to N g genes. We first introduce some variables to make the presentation easier. We define θ D i = θ D=i , similarly, θ F j |D i = θ F= j|D=i , and θ G k |F j ,H l = θ G=k|F= j,H=l . Notice that we assume these parameters are the same for different genes. This assumption is necessary otherwise we do not have enough data to robustly estimate model parameters. Then we define hidden variable Z g D i = 1 if and only if D g = i. Similarly, indicator variable Z g,m F j |D i = 1 if and only if F g,m = j and D g = i. Indicator variable Z g,m,n G k |F j ,H l = 1 if and only if G g,m,n = k, F g,m = j and H g,n = l. Finally, we use D g = 1 to denote that the mutations in g impact expression, and D g = 0 means not; F g,m = 1 means that the mutation in gene g of patient m impacting expression, and F g,m means not; H g,n = L means that the nth gene connected to gene g is down-regulated across patients, and H g,n = G means that the nth gene connected to gene g is up-regulated across patients. Similarly, G g,m,n ∈ L, N , G , where L, N , G represents down-regulation, neutral and up-regulation of expression, respectively.
We then take the expectation of the complete log-likelihood function to get log θ Z g,m,n G k |F j ,H l G k |F j ,H l + log(p(Y g,m,n | G g,m,n )) whereN Z D i = T g=1 P(D g = i|D) (13) N Z G k |F j ,H l = T g=1 M g m=1 N m,g n=1 P(G g,m,n = k, F g,m = j, H g,n = l|D) These terms can be computed by using the belief propagation algorithm introduced in the previous section. The complete data log-likelihood expectation function in Equation 12 can be maximized by solving a constraint optimization problem to get the maximization steps: Seudocounts can be added to the above maximization steps to do maximum a posteriori estimation of parameters.

Adding constraints in learning xseq parameters
Parameter learning is a challenge problem in using Bayesian networks to solve real world problems. When the training data are incomplete or the Bayesian networks have multiple hidden nodes, parameter learning is extremely difficult, and learning algorithms can easily get trapped into local maxima. One solution to mitigate these local maximum problems is to add prior distributions for parameters and use maximum a posterior estimation of parameters. However, we still need to set the hyper-parameters of the prior distributions. These parameters may be data dependent and difficult to set. Here in addition to selecting good hyper-parameters for the prior distributions, we also constrain the parameter space in learning xseq conditional distributions. More specifically, in We next used the xseq-simple model to analyze the simulation datasets ( Supplementary  Fig. 4). Generally speaking, the xseq-simple performed quite well. Compared to xseq predictions given the true values of H variables, we could see the drop in performance, especially for the most challenging case (Supplementary Fig. 3 and Supplementary Fig. 4, bottom-right ROC curves). Because xseq-simple does not model the H variables, it substituted the two conditional distributions θ G=L|F=1,H=L and θ G=L|F=1,H=G with a single conditional distribution θ G=L|F=1 . We could see this phenomenon from the parameter trace plots during EM algorithm iterations (Supplementary Fig. 5).
For completeness, we also did a simulation analysis of the cis-effects of somatic mutations ( Supplementary Fig. 6). It is rare for a gene to be up-regulated across patients in the presence of loss-of-function mutations (i.e., H = G). Since these events are used to estimate the parameters θ G|F=1,H=G , we may not have enough events to robustly estimate these parameters. By contrast, the xseq-simple model (which does not have the H variable) is sufficient for modelling the cis-effects of mutations. Therefore, we sampled data from the xseq-simple model for cis-analysis (using only the most challenging hyper-parameter regime). Compared to trans-analysis, it took longer for xseq to converge (more than 100 iterations), likely due to only one connected gene available for inference. Overall, results were similar to those obtained for trans simulations.

Downloading the Pan-Cancer datasets
We obtained the somatic mutation calls from the Synapse platform (syn1729383 58 ). Affymetrix SNP6.0 copy number log2 values, GISTIC copy number calls, and RNASeq gene expression data were downloaded from syn300013 59 .

Mutation probabilities in genes show bimodal distributions
Here we analyzed the mutations in the 127 significantly mutated genes predicted by MuSiC 60 to detect the genes whose xseq mutation probabilities P(F) show bimodal distributions in at least one tumour type. For the mutations of a significantly mutated gene in a tumour type, the dip-test 61 was used to test for unimodality of the xseq mutation probability distribution (P(F)). The genes with FDR ≤ 0.2 (Benjamini-Hochberg procedure 62 ) were shown in Supplementary Table 8. Finally, for the genes given in Supplementary Table 8, we drew heatmaps as given in Supplementary Fig. 29 to show the heterogeneity in gene expression profiles.

Methods comparison 2.5.1 Comparison with CONEXIC
We ran CONEXIC algorithm on the TCGA breast cancer copy number and RNASeq expression data. We first used the default parameter setting and initialized the clustering using K-means with K = 50 clusters. The candidate driver CNAs were the focal copy number deletions and amplifications from the Pan-Cancer analysis 63 . Because of the large number of candidate driver genes (focal copy number alterations in 2891/3044 genes which have Entrez gene symbols), CONEXIC generated 1385 modules, with median module size of 6 and modulator number of 3. These modules selected uniquely 313 modulators, and 21/313 genes were putative cancer driver genes (Supplementary Table 6). Another 46/292 genes were directly connected partners of the bona fide cancer driver genes.
We next used the 131 'target' genes of 140 focal copy number alterations predicted by GISTIC 63 as the candidate driver genes, and first initialized CONEXIC by single module analysis. The single module analysis generated 62 modules. We then ran CONEXIC ModuleNetwork with this initial clustering. This analysis generated 1136 modules, with median module size of 7 and modulator number of 2. These modules selected uniquely 72 modulators, and 27/72 genes were bona fide cancer driver genes (Supplementary Table 6). Another 12/45 genes were directly connected partners of the bona fide cancer driver genes. Sixteen of the 72 genes overlapped with xseq predictions: CBX8, CCND1, CCNE1, CDKN2A, CYC1, E2F3, EGFR, ERBB2, KDM5A, MCL1, MRPS28, NEDD9, PPP2R2A, PTEN, SMARCA2, and STK11.
Compared to CONEXIC, xseq is more specific while less sensitive. As can be seen from Supplementary Table 6, 26/40 genes predicted by xseq were either bona fide driver genes or directly connected parters of bona fide driver genes, while 67/313 CONEXIC predicted gene were either bona fide driver genes or directly connected partners of bona fide driver genes (p-value ≤ 0.01). As an unsupervised method, CONEXIC can be powerful in prediction cancer driver modulators and modules when initialized properly, e.g., 39/72 genes were predicted as putative cancer genes when the 131 GISTIC focal copy number targets were used as inputs. For xseq, it needs 'negative' examples to learn model parameters, so we simply counted the overlap between the 40 predicted gene and the 131 GISTIC targets of focal copy number alterations, and found