Abstract
Functional interpretation of genomic variation is critical to understanding human disease, but it remains difficult to predict the effects of specific mutations on protein interaction networks and the phenotypes they regulate. We describe an analytical framework based on multiscale statistical mechanics that integrates genomic and biophysical data to model the human SH2-phosphoprotein network in normal and cancer cells. We apply our approach to data in The Cancer Genome Atlas (TCGA) and test model predictions experimentally. We find that mutations mapping to phosphoproteins often create new interactions but that mutations altering SH2 domains result almost exclusively in loss of interactions. Some of these mutations eliminate all interactions, but many cause more selective loss, thereby rewiring specific edges in highly connected subnetworks. Moreover, idiosyncratic mutations appear to be as functionally consequential as recurrent mutations. By synthesizing genomic, structural and biochemical data, our framework represents a new approach to the interpretation of genetic variation.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Predicting Antigenicity of Influenza A Viruses Using biophysical ideas
Scientific Reports Open Access 15 July 2019
-
Network spectra for drug-target identification in complex diseases: new guns against old foes
Applied Network Science Open Access 17 December 2018
-
Typing tumors using pathways selected by somatic evolution
Nature Communications Open Access 08 October 2018
Access options
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout








Accession codes
References
Lawrence, M.S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495–501 (2014).
Liu, B.A., Engelmann, B.W. & Nash, P.D. High-throughput analysis of peptide-binding modules. Proteomics 12, 1527–1546 (2012).
Ho, Y. et al. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180–183 (2002).
Bader, G.D. & Hogue, C.W.V. Analyzing yeast protein-protein interaction data obtained from different sources,. Nat. Biotechnol. 20, 991–997 (2002).
Gschwind, A., Fischer, O.M. & Ullrich, A. The discovery of receptor tyrosine kinases: targets for cancer therapy. Nat. Rev. Cancer 4, 361–370 (2004).
Zhong, Q. et al. Edgetic perturbation models of human inherited disorders. Mol. Syst. Biol. 5, 321 (2009).
Ren, J. et al. PhosSNP for systematic analysis of genetic polymorphisms that influence protein phosphorylation. Mol. Cell. Proteomics 9, 623–634 (2010).
Tamborero, D. et al. Comprehensive identification of mutational cancer driver genes across 12 tumor types. Sci. Rep. 3, 2650 (2013).
Birge, R.B., Kalodimos, C., Inagaki, F. & Tanaka, S. Crk and CrkL adaptor proteins: networks for physiological and pathological signaling. Cell Commun. Signal. 7, 13 (2009).
Li, L. et al. Prediction of phosphotyrosine signaling networks using a scoring matrix-assisted ligand identification approach. Nucleic Acids Res. 36, 3263–3273 (2008).
Kundu, K., Costa, F., Huber, M., Reth, M. & Backofen, R. Semi-supervised prediction of SH2-peptide interactions from imbalanced high-throughput data. PLoS ONE 8, e62732 (2013).
Miller, M.L. et al. Linear motif atlas for phosphorylation-dependent signaling. Sci. Signal. 1, ra2 (2008).
Wunderlich, Z. & Mirny, L.A. Using genome-wide measurements for computational prediction of SH2-peptide interactions. Nucleic Acids Res. 37, 4629–4641 (2009).
Zhang, Q.C. et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490, 556–560 (2012).
Franceschini, A. et al. STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 41, D808–D815 (2013).
Forbes, S.A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 39, D945–D950 (2011).
AlQuraishi, M. & McAdams, H.H. Direct inference of protein-DNA interactions using compressed sensing methods. Proc. Natl. Acad. Sci. USA 108, 14819–14824 (2011).
Koytiger, G. et al. Phosphotyrosine signaling proteins that drive oncogenesis tend to be highly interconnected. Mol. Cell. Proteomics 12, 1204–1213 (2013).
Hause, R.J. et al. Comprehensive binary interaction mapping of SH2 domains via fluorescence polarization reveals novel functional diversification of ErbB receptors. PLoS ONE 7, e44471 (2012).
Liu, B.A. et al. SH2 domains recognize contextual peptide sequence information to determine selectivity. Mol. Cell. Proteomics 9, 2391–2404 (2010).
Tinti, M. et al. The SH2 domain interaction landscape. Cell Rep. 3, 1293–1305 (2013).
Hornbeck, P.V. et al. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res. 40, D261–D270 (2012).
Branden, C. & Tooze, J. Introduction to Protein Structure (Garland Science, New York, 1999).
von Mering, C. et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature 417, 399–403 (2002).
Lossos, I.S., Alizadeh, A.A., Rajapaksa, R., Tibshirani, R. & Levy, R. HGAL is a novel interleukin-4–inducible gene that strongly predicts survival in diffuse large B-cell lymphoma. Blood 101, 433–440 (2003).
Natkunam, Y. et al. Expression of the human germinal center–associated lymphoma (HGAL) protein identifies a subset of classic Hodgkin lymphoma of germinal center derivation and improved survival. Blood 109, 298–305 (2007).
Pan, Z. et al. Studies of a germinal centre B-cell expressed gene, GCET2, suggest its role as a membrane associated adapter protein. Br. J. Haematol. 137, 578–590 (2007).
Romero-Camarero, I. et al. Germinal centre protein HGAL promotes lymphoid hyperplasia and amyloidosis via BCR-mediated Syk activation. Nat. Commun. 4, 1338 (2013).
Davies, H. et al. Somatic mutations of the protein kinase gene family in human lung cancer. Cancer Res. 65, 7591–7595 (2005).
Zhao, M., Sun, J. & Zhao, Z. TSGene: a web resource for tumor suppressor genes. Nucleic Acids Res. 41, D970–D976 (2013).
Watson, I.R., Takahashi, K., Futreal, P.A. & Chin, L. Emerging patterns of somatic mutations in cancer. Nat. Rev. Genet. 14, 703–718 (2013).
Marengere, L.E. et al. SH2 domain specificity and activity modified by a single residue. Nature 369, 502–505 (1994).
Cabodi, S., del Pilar Camacho-Leal, M., Di Stefano, P. & Defilippi, P. Integrin signalling adaptors: not only figurants in the cancer story. Nat. Rev. Cancer 10, 858–870 (2010).
Haynie, D.T. Molecular physiology of the tensin brotherhood of integrin adaptor proteins. Proteins 82, 1113–1127 (2014).
Ewan, K.B.R. et al. Proliferation of estrogen receptor-α–positive mammary epithelial cells is restrained by transforming growth factor-β1 in adult mice. Am. J. Pathol. 167, 409–417 (2005).
Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res. 8, 175–185 (1998).
Freedman, L.P. & Inglese, J. The increasing urgency for standards in basic biologic research. Cancer Res. 74, 4024–4029 (2014).
Acknowledgements
This work was supported by US National Institutes of Health grants GM68762, GM107618 and GM072872. We used the resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the US Department of Energy under contract DE-AC02-05CH11231.
Author information
Authors and Affiliations
Contributions
All authors conceived and designed the study. M.A., G.K. and P.K.S. wrote the manuscript. M.A. developed the mathematical model. G.K. performed the experiments. All authors discussed and interpreted the results.
Corresponding authors
Ethics declarations
Competing interests
G.M. is an employee of and shareholder in Merrimack Pharmaceuticals, Inc. P.K.S. is also a stockholder in Merrimack Pharmaceuticals, Inc., and is chair of its scientific advisory board.
Integrated supplementary information
Supplementary Figure 1 Details of the multiscale statistical mechanical framework.
Ensembles of all levels of the MSM hierarchy are shown (left), along with their underlying mathematical forms (right). At the residue level, the energies of residue-residue interactions (E) are estimated using sparse reconstruction techniques from sequence (S), biochemical (B) and structural (D) data. At the domain level, ensembles comprise bound and unbound SH2-pY complexes (green and blue circles, respectively; edges denote bound domains and sites), and their energies are computed using E. At the protein level, ensembles comprise all permissible bound states and the unbound state. Two proteins are depicted, one with two SH2 domains and one tyrosine phosphosite and another with two tyrosine phosphosites. Interaction probabilities between multi-domain and multi-site proteins are computed using the domain-level Hamiltonian. Finally, at the network mutation level, ensembles comprise states made up of two partitions, one representing the pre-mutation state and the other the post-mutation state. The ensembles of interest comprise causal and qualitative gain-of-function (shown) and loss-of-function states (see also Supplementary Fig. 2). Two proteins are depicted, one with a single SH2 domain and the other with two tyrosine phosphosites, one of which is mutated (pink). The probability that a random disease-related mutation disrupts or forms an interaction in a protein-protein network is computed using protein-level probabilities and empirical disease-specific mutation frequencies (Pgeno). The entire framework is described by a single equation. Full derivations of mathematical expressions appear in the Supplementary Note.
Supplementary Figure 2 States of the network mutation ensemble.
(a) All possible partitioned states of the network mutation ensemble are shown. Each state simultaneously captures the pre- and post-mutation states of a protein-protein interaction. Two proteins are depicted, one with a single SH2 domain and the other with two tyrosine phosphosites, one of which is mutated (pink). Edges denote bound domains and sites. (b) Subsets of the ensemble in a that formally correspond to a causal change (blue shading) and a qualitative change (green shading). See the Supplementary Note for mathematical details.
Supplementary Figure 3 Cross-validation and sensitivity analysis of MSM/D.
(a) Binding data for SH2-peptide interactions were divided into n subsets (outer cross-validation (CV) sets). For each outer CV set, the remaining (n – 1) subsets were combined and then divided into m subsets (inner CV sets). For each inner CV set, the remaining (m – 1) subsets were combined and used to train model parameters. Performance was tested on the held-out inner CV set to identify the best metaparameter values. The held-out inner CV set was then recombined with the m – 1 subsets, and model parameters were retrained using the previously identified metaparameter values. Final performance was tested on the held-out outer CV set, and the values from all runs of this procedure were combined to yield the final performance numbers. The values of m and n varied depending on the test. See the Supplementary Note for more information. (b) The ROC curves for MSM/D on all outer CV sets (n = 8) used in testing overall performance are shown. (c) AUC performance of MSM/D as a function of the λ metaparameter is shown for all outer CV sets (n = 8) used in testing overall performance. (d) The AUCs of MSM/D in predicting held-out SH2 domains are plotted as a function of the domains’ sequence identity to the closest homolog in the training set. A histogram of AUC values is overlaid on the y axis. For outer CV sets, n was set to 8, and, for inner CV sets, m was set to 7. On average, there were 13 held-out SH2 domains in each outer CV set. (e) AUCs of MSM/D, normalized to the maximum AUC achieved when MSM/D is trained on all the data (using nested cross-validation), are plotted as a function of the percentage of the data used in training MSM/D.
Supplementary Figure 4 Human SH2 phosphosignaling network.
MSM/P predictions of all interaction probabilities between SH2 proteins and phosphoproteins with p > 0.85 are shown as a directed network, with edges coming out of SH2 proteins and into phosphoproteins. The network was generated using a spring-electrical embedding in the Mathematica software package.
Supplementary Figure 5 Statistics and outliers of the human SH2 phosphosignaling network.
In-degree (top) and out-degree (middle) distributions of phosphoproteins and SH2 proteins, respectively, as well as the joint in/out-degree distribution (bottom) of proteins containing both phosphosites and SH2 domains, were computed. All distributions were thresholded at three values (p > 0.55, 0.70 and 0.85). Outliers are highlighted by small arrows. Owing to the asymmetric nature of SH2-phosphoprotein interactions (edges come out of SH2 domains and into pY peptide sites), in- and out-degree distributions represent distinct biological properties of the network. At all thresholds tested, the average out degree of SH2 proteins is ~10 times higher than the average in degree of phosphoproteins. Thus, each SH2 protein interacts with a large number of phosphoproteins (up to 25 at p > 0.85) but not vice versa. The majority of SH2 proteins that also contain pY sites behaved strictly as SH2 domains. A minor fraction of these proteins (~5% at p > 0.85), however, exhibited high in and out degrees across all probability thresholds, including ABL1, SOCS3, PTPN6 and PTPN11, suggesting that these proteins function as sites of cross-talk and signal integration.
Supplementary Figure 6 General tumor network.
MSM/N predictions of all perturbed interactions in cancer (green and yellow arrows) with p > 0.52 (without factoring in genomic prior) are overlaid on the wild-type SH2 phosphosignaling network (gray edges, each representing an interaction with p > 0.85 probability, as in Supplementary Fig. 4). Four outliers (GRAP2, BLK, BCAR3 and SH2D1B) were removed to facilitate visualization. The network was generated using a spring-electrical embedding in the Mathematica software package.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–6 and Supplementary Note. (PDF 3258 kb)
Supplementary Table 1
Raw data of domain-level interactions used in training MSM/D. (XLSX 1923 kb)
Supplementary Table 2
Predicted protein-level interaction probabilities for SH2-phosphoprotein network. (XLSX 19318 kb)
Supplementary Table 3
Predicted perturbation probabilities for every COSMIC mutation. (XLSX 30057 kb)
Supplementary Table 4
PEMs for all human SH2 domains. (XLSX 449 kb)
Supplementary Data Set 1
Predicted perturbed protein-level interaction probabilities for general tumor network. (ZIP 5734 kb)
Supplementary Data Set 2
Predicted perturbed protein-level interaction probabilities for large intestine tumor network. (ZIP 4804 kb)
Supplementary Data Set 3
Predicted perturbed protein-level interaction probabilities for lung tumor network. (ZIP 4580 kb)
Supplementary Data Set 4
Predicted perturbed protein-level interaction probabilities for endometrium tumor network. (ZIP 4363 kb)
Supplementary Data Set 5
Predicted perturbed protein-level interaction probabilities for skin tumor network. (ZIP 2246 kb)
Supplementary Data Set 6
Predicted perturbed protein-level interaction probabilities for liver tumor network. (ZIP 1642 kb)
Supplementary Data Set 7
Predicted perturbed protein-level interaction probabilities for breast tumor network. (ZIP 2333 kb)
Supplementary Data Set 8
Predicted perturbed protein-level interaction probabilities for hematopoietic and lymphoid tumor network. (ZIP 891 kb)
Supplementary Data Set 9
Predicted perturbed protein-level interaction probabilities for ovary tumor network. (ZIP 1690 kb)
Supplementary Data Set 10
Predicted perturbed protein-level interaction probabilities for esophagus tumor network. (ZIP 1551 kb)
Supplementary Data Set 11
Predicted perturbed protein-level interaction probabilities for kidney tumor network. (ZIP 1506 kb)
Supplementary Data Set 12
Predicted perturbed protein-level interaction probabilities for urinary tract tumor network. (ZIP 1141 kb)
Supplementary Data Set 13
Predicted perturbed protein-level interaction probabilities for prostate tumor network. (ZIP 955 kb)
Supplementary Data Set 14
Predicted perturbed protein-level interaction probabilities for CNS tumor network. (ZIP 478 kb)
Supplementary Data Set 15
Predicted perturbed protein-level interaction probabilities for autonomic ganglia tumor network. (ZIP 508 kb)
Supplementary Data Set 16
Predicted perturbed protein-level interaction probabilities for upper aerodigestive tract tumor network. (ZIP 505 kb)
Supplementary Data Set 17
Predicted perturbed protein-level interaction probabilities for pancreas tumor network. (ZIP 470 kb)
Supplementary Data Set 18
Predicted perturbed protein-level interaction probabilities for stomach tumor network. (ZIP 523 kb)
Supplementary Data Set 19
Predicted perturbed protein-level interaction probabilities for cervix tumor network. (ZIP 342 kb)
Supplementary Data Set 20
Predicted perturbed protein-level interaction probabilities for salivary gland tumor network. (ZIP 242 kb)
Supplementary Data Set 21
Predicted perturbed protein-level interaction probabilities for bone tumor network. (ZIP 351 kb)
Supplementary Data Set 22
Predicted perturbed protein-level interaction probabilities for thyroid tumor network. (ZIP 288 kb)
Supplementary Data Set 23
Predicted perturbed protein-level interaction probabilities for meninges tumor network. (ZIP 286 kb)
Supplementary Data Set 24
Predicted perturbed protein-level interaction probabilities for adrenal gland tumor network. (ZIP 230 kb)
Supplementary Data Set 25
Predicted perturbed protein-level interaction probabilities for soft-tissue tumor network. (ZIP 287 kb)
Source data
Rights and permissions
About this article
Cite this article
AlQuraishi, M., Koytiger, G., Jenney, A. et al. A multiscale statistical mechanical framework integrates biophysical and genomic data to assemble cancer networks. Nat Genet 46, 1363–1371 (2014). https://doi.org/10.1038/ng.3138
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng.3138
This article is cited by
-
Biophysical prediction of protein–peptide interactions and signaling networks using machine learning
Nature Methods (2020)
-
Predicting Antigenicity of Influenza A Viruses Using biophysical ideas
Scientific Reports (2019)
-
Typing tumors using pathways selected by somatic evolution
Nature Communications (2018)
-
Network spectra for drug-target identification in complex diseases: new guns against old foes
Applied Network Science (2018)
-
The recurrent architecture of tumour initiation, progression and drug sensitivity
Nature Reviews Cancer (2017)