Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

A multiscale statistical mechanical framework integrates biophysical and genomic data to assemble cancer networks


Functional interpretation of genomic variation is critical to understanding human disease, but it remains difficult to predict the effects of specific mutations on protein interaction networks and the phenotypes they regulate. We describe an analytical framework based on multiscale statistical mechanics that integrates genomic and biophysical data to model the human SH2-phosphoprotein network in normal and cancer cells. We apply our approach to data in The Cancer Genome Atlas (TCGA) and test model predictions experimentally. We find that mutations mapping to phosphoproteins often create new interactions but that mutations altering SH2 domains result almost exclusively in loss of interactions. Some of these mutations eliminate all interactions, but many cause more selective loss, thereby rewiring specific edges in highly connected subnetworks. Moreover, idiosyncratic mutations appear to be as functionally consequential as recurrent mutations. By synthesizing genomic, structural and biochemical data, our framework represents a new approach to the interpretation of genetic variation.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Rent or buy this article

Get just this article for as long as you need it


Prices may be subject to local taxes which are calculated during checkout

Figure 1: Multiscale statistical mechanical (MSM) framework.
Figure 2: Assessment of domain model (MSM/D) performance.
Figure 3: Experimental validation of protein-level interactions for wild-type and mutant proteins.
Figure 4: Enrichment and analysis of cancer mutations.
Figure 5: Tissue-specific tumor networks.
Figure 6: Kidney tumor network.
Figure 7: The model enriches high-throughput experiments.
Figure 8: PEMs capture the biophysical basis of SH2 domain specificity.

Accession codes


Protein Data Bank


  1. Lawrence, M.S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495–501 (2014).

    Article  CAS  Google Scholar 

  2. Liu, B.A., Engelmann, B.W. & Nash, P.D. High-throughput analysis of peptide-binding modules. Proteomics 12, 1527–1546 (2012).

    Article  CAS  Google Scholar 

  3. Ho, Y. et al. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180–183 (2002).

    Article  CAS  Google Scholar 

  4. Bader, G.D. & Hogue, C.W.V. Analyzing yeast protein-protein interaction data obtained from different sources,. Nat. Biotechnol. 20, 991–997 (2002).

    Article  CAS  Google Scholar 

  5. Gschwind, A., Fischer, O.M. & Ullrich, A. The discovery of receptor tyrosine kinases: targets for cancer therapy. Nat. Rev. Cancer 4, 361–370 (2004).

    Article  CAS  Google Scholar 

  6. Zhong, Q. et al. Edgetic perturbation models of human inherited disorders. Mol. Syst. Biol. 5, 321 (2009).

    Article  Google Scholar 

  7. Ren, J. et al. PhosSNP for systematic analysis of genetic polymorphisms that influence protein phosphorylation. Mol. Cell. Proteomics 9, 623–634 (2010).

    Article  CAS  Google Scholar 

  8. Tamborero, D. et al. Comprehensive identification of mutational cancer driver genes across 12 tumor types. Sci. Rep. 3, 2650 (2013).

    Article  Google Scholar 

  9. Birge, R.B., Kalodimos, C., Inagaki, F. & Tanaka, S. Crk and CrkL adaptor proteins: networks for physiological and pathological signaling. Cell Commun. Signal. 7, 13 (2009).

    Article  Google Scholar 

  10. Li, L. et al. Prediction of phosphotyrosine signaling networks using a scoring matrix-assisted ligand identification approach. Nucleic Acids Res. 36, 3263–3273 (2008).

    Article  CAS  Google Scholar 

  11. Kundu, K., Costa, F., Huber, M., Reth, M. & Backofen, R. Semi-supervised prediction of SH2-peptide interactions from imbalanced high-throughput data. PLoS ONE 8, e62732 (2013).

    Article  CAS  Google Scholar 

  12. Miller, M.L. et al. Linear motif atlas for phosphorylation-dependent signaling. Sci. Signal. 1, ra2 (2008).

    Article  Google Scholar 

  13. Wunderlich, Z. & Mirny, L.A. Using genome-wide measurements for computational prediction of SH2-peptide interactions. Nucleic Acids Res. 37, 4629–4641 (2009).

    Article  CAS  Google Scholar 

  14. Zhang, Q.C. et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490, 556–560 (2012).

    Article  CAS  Google Scholar 

  15. Franceschini, A. et al. STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 41, D808–D815 (2013).

    Article  CAS  Google Scholar 

  16. Forbes, S.A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 39, D945–D950 (2011).

    Article  CAS  Google Scholar 

  17. AlQuraishi, M. & McAdams, H.H. Direct inference of protein-DNA interactions using compressed sensing methods. Proc. Natl. Acad. Sci. USA 108, 14819–14824 (2011).

    Article  CAS  Google Scholar 

  18. Koytiger, G. et al. Phosphotyrosine signaling proteins that drive oncogenesis tend to be highly interconnected. Mol. Cell. Proteomics 12, 1204–1213 (2013).

    Article  CAS  Google Scholar 

  19. Hause, R.J. et al. Comprehensive binary interaction mapping of SH2 domains via fluorescence polarization reveals novel functional diversification of ErbB receptors. PLoS ONE 7, e44471 (2012).

    Article  CAS  Google Scholar 

  20. Liu, B.A. et al. SH2 domains recognize contextual peptide sequence information to determine selectivity. Mol. Cell. Proteomics 9, 2391–2404 (2010).

    Article  CAS  Google Scholar 

  21. Tinti, M. et al. The SH2 domain interaction landscape. Cell Rep. 3, 1293–1305 (2013).

    Article  CAS  Google Scholar 

  22. Hornbeck, P.V. et al. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res. 40, D261–D270 (2012).

    Article  CAS  Google Scholar 

  23. Branden, C. & Tooze, J. Introduction to Protein Structure (Garland Science, New York, 1999).

  24. von Mering, C. et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature 417, 399–403 (2002).

    Article  CAS  Google Scholar 

  25. Lossos, I.S., Alizadeh, A.A., Rajapaksa, R., Tibshirani, R. & Levy, R. HGAL is a novel interleukin-4–inducible gene that strongly predicts survival in diffuse large B-cell lymphoma. Blood 101, 433–440 (2003).

    Article  CAS  Google Scholar 

  26. Natkunam, Y. et al. Expression of the human germinal center–associated lymphoma (HGAL) protein identifies a subset of classic Hodgkin lymphoma of germinal center derivation and improved survival. Blood 109, 298–305 (2007).

    Article  CAS  Google Scholar 

  27. Pan, Z. et al. Studies of a germinal centre B-cell expressed gene, GCET2, suggest its role as a membrane associated adapter protein. Br. J. Haematol. 137, 578–590 (2007).

    Article  CAS  Google Scholar 

  28. Romero-Camarero, I. et al. Germinal centre protein HGAL promotes lymphoid hyperplasia and amyloidosis via BCR-mediated Syk activation. Nat. Commun. 4, 1338 (2013).

    Article  Google Scholar 

  29. Davies, H. et al. Somatic mutations of the protein kinase gene family in human lung cancer. Cancer Res. 65, 7591–7595 (2005).

    Article  CAS  Google Scholar 

  30. Zhao, M., Sun, J. & Zhao, Z. TSGene: a web resource for tumor suppressor genes. Nucleic Acids Res. 41, D970–D976 (2013).

    Article  CAS  Google Scholar 

  31. Watson, I.R., Takahashi, K., Futreal, P.A. & Chin, L. Emerging patterns of somatic mutations in cancer. Nat. Rev. Genet. 14, 703–718 (2013).

    Article  CAS  Google Scholar 

  32. Marengere, L.E. et al. SH2 domain specificity and activity modified by a single residue. Nature 369, 502–505 (1994).

    Article  CAS  Google Scholar 

  33. Cabodi, S., del Pilar Camacho-Leal, M., Di Stefano, P. & Defilippi, P. Integrin signalling adaptors: not only figurants in the cancer story. Nat. Rev. Cancer 10, 858–870 (2010).

    Article  CAS  Google Scholar 

  34. Haynie, D.T. Molecular physiology of the tensin brotherhood of integrin adaptor proteins. Proteins 82, 1113–1127 (2014).

    Article  CAS  Google Scholar 

  35. Ewan, K.B.R. et al. Proliferation of estrogen receptor-α–positive mammary epithelial cells is restrained by transforming growth factor-β1 in adult mice. Am. J. Pathol. 167, 409–417 (2005).

    Article  CAS  Google Scholar 

  36. Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res. 8, 175–185 (1998).

    Article  CAS  Google Scholar 

  37. Freedman, L.P. & Inglese, J. The increasing urgency for standards in basic biologic research. Cancer Res. 74, 4024–4029 (2014).

    Article  CAS  Google Scholar 

Download references


This work was supported by US National Institutes of Health grants GM68762, GM107618 and GM072872. We used the resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the US Department of Energy under contract DE-AC02-05CH11231.

Author information

Authors and Affiliations



All authors conceived and designed the study. M.A., G.K. and P.K.S. wrote the manuscript. M.A. developed the mathematical model. G.K. performed the experiments. All authors discussed and interpreted the results.

Corresponding authors

Correspondence to Mohammed AlQuraishi or Peter K Sorger.

Ethics declarations

Competing interests

G.M. is an employee of and shareholder in Merrimack Pharmaceuticals, Inc. P.K.S. is also a stockholder in Merrimack Pharmaceuticals, Inc., and is chair of its scientific advisory board.

Integrated supplementary information

Supplementary Figure 1 Details of the multiscale statistical mechanical framework.

Ensembles of all levels of the MSM hierarchy are shown (left), along with their underlying mathematical forms (right). At the residue level, the energies of residue-residue interactions (E) are estimated using sparse reconstruction techniques from sequence (S), biochemical (B) and structural (D) data. At the domain level, ensembles comprise bound and unbound SH2-pY complexes (green and blue circles, respectively; edges denote bound domains and sites), and their energies are computed using E. At the protein level, ensembles comprise all permissible bound states and the unbound state. Two proteins are depicted, one with two SH2 domains and one tyrosine phosphosite and another with two tyrosine phosphosites. Interaction probabilities between multi-domain and multi-site proteins are computed using the domain-level Hamiltonian. Finally, at the network mutation level, ensembles comprise states made up of two partitions, one representing the pre-mutation state and the other the post-mutation state. The ensembles of interest comprise causal and qualitative gain-of-function (shown) and loss-of-function states (see also Supplementary Fig. 2). Two proteins are depicted, one with a single SH2 domain and the other with two tyrosine phosphosites, one of which is mutated (pink). The probability that a random disease-related mutation disrupts or forms an interaction in a protein-protein network is computed using protein-level probabilities and empirical disease-specific mutation frequencies (Pgeno). The entire framework is described by a single equation. Full derivations of mathematical expressions appear in the Supplementary Note.

Supplementary Figure 2 States of the network mutation ensemble.

(a) All possible partitioned states of the network mutation ensemble are shown. Each state simultaneously captures the pre- and post-mutation states of a protein-protein interaction. Two proteins are depicted, one with a single SH2 domain and the other with two tyrosine phosphosites, one of which is mutated (pink). Edges denote bound domains and sites. (b) Subsets of the ensemble in a that formally correspond to a causal change (blue shading) and a qualitative change (green shading). See the Supplementary Note for mathematical details.

Supplementary Figure 3 Cross-validation and sensitivity analysis of MSM/D.

(a) Binding data for SH2-peptide interactions were divided into n subsets (outer cross-validation (CV) sets). For each outer CV set, the remaining (n – 1) subsets were combined and then divided into m subsets (inner CV sets). For each inner CV set, the remaining (m – 1) subsets were combined and used to train model parameters. Performance was tested on the held-out inner CV set to identify the best metaparameter values. The held-out inner CV set was then recombined with the m – 1 subsets, and model parameters were retrained using the previously identified metaparameter values. Final performance was tested on the held-out outer CV set, and the values from all runs of this procedure were combined to yield the final performance numbers. The values of m and n varied depending on the test. See the Supplementary Note for more information. (b) The ROC curves for MSM/D on all outer CV sets (n = 8) used in testing overall performance are shown. (c) AUC performance of MSM/D as a function of the λ metaparameter is shown for all outer CV sets (n = 8) used in testing overall performance. (d) The AUCs of MSM/D in predicting held-out SH2 domains are plotted as a function of the domains’ sequence identity to the closest homolog in the training set. A histogram of AUC values is overlaid on the y axis. For outer CV sets, n was set to 8, and, for inner CV sets, m was set to 7. On average, there were 13 held-out SH2 domains in each outer CV set. (e) AUCs of MSM/D, normalized to the maximum AUC achieved when MSM/D is trained on all the data (using nested cross-validation), are plotted as a function of the percentage of the data used in training MSM/D.

Supplementary Figure 4 Human SH2 phosphosignaling network.

MSM/P predictions of all interaction probabilities between SH2 proteins and phosphoproteins with p > 0.85 are shown as a directed network, with edges coming out of SH2 proteins and into phosphoproteins. The network was generated using a spring-electrical embedding in the Mathematica software package.

Supplementary Figure 5 Statistics and outliers of the human SH2 phosphosignaling network.

In-degree (top) and out-degree (middle) distributions of phosphoproteins and SH2 proteins, respectively, as well as the joint in/out-degree distribution (bottom) of proteins containing both phosphosites and SH2 domains, were computed. All distributions were thresholded at three values (p > 0.55, 0.70 and 0.85). Outliers are highlighted by small arrows. Owing to the asymmetric nature of SH2-phosphoprotein interactions (edges come out of SH2 domains and into pY peptide sites), in- and out-degree distributions represent distinct biological properties of the network. At all thresholds tested, the average out degree of SH2 proteins is ~10 times higher than the average in degree of phosphoproteins. Thus, each SH2 protein interacts with a large number of phosphoproteins (up to 25 at p > 0.85) but not vice versa. The majority of SH2 proteins that also contain pY sites behaved strictly as SH2 domains. A minor fraction of these proteins (~5% at p > 0.85), however, exhibited high in and out degrees across all probability thresholds, including ABL1, SOCS3, PTPN6 and PTPN11, suggesting that these proteins function as sites of cross-talk and signal integration.

Supplementary Figure 6 General tumor network.

MSM/N predictions of all perturbed interactions in cancer (green and yellow arrows) with p > 0.52 (without factoring in genomic prior) are overlaid on the wild-type SH2 phosphosignaling network (gray edges, each representing an interaction with p > 0.85 probability, as in Supplementary Fig. 4). Four outliers (GRAP2, BLK, BCAR3 and SH2D1B) were removed to facilitate visualization. The network was generated using a spring-electrical embedding in the Mathematica software package.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–6 and Supplementary Note. (PDF 3258 kb)

Supplementary Table 1

Raw data of domain-level interactions used in training MSM/D. (XLSX 1923 kb)

Supplementary Table 2

Predicted protein-level interaction probabilities for SH2-phosphoprotein network. (XLSX 19318 kb)

Supplementary Table 3

Predicted perturbation probabilities for every COSMIC mutation. (XLSX 30057 kb)

Supplementary Table 4

PEMs for all human SH2 domains. (XLSX 449 kb)

Supplementary Data Set 1

Predicted perturbed protein-level interaction probabilities for general tumor network. (ZIP 5734 kb)

Supplementary Data Set 2

Predicted perturbed protein-level interaction probabilities for large intestine tumor network. (ZIP 4804 kb)

Supplementary Data Set 3

Predicted perturbed protein-level interaction probabilities for lung tumor network. (ZIP 4580 kb)

Supplementary Data Set 4

Predicted perturbed protein-level interaction probabilities for endometrium tumor network. (ZIP 4363 kb)

Supplementary Data Set 5

Predicted perturbed protein-level interaction probabilities for skin tumor network. (ZIP 2246 kb)

Supplementary Data Set 6

Predicted perturbed protein-level interaction probabilities for liver tumor network. (ZIP 1642 kb)

Supplementary Data Set 7

Predicted perturbed protein-level interaction probabilities for breast tumor network. (ZIP 2333 kb)

Supplementary Data Set 8

Predicted perturbed protein-level interaction probabilities for hematopoietic and lymphoid tumor network. (ZIP 891 kb)

Supplementary Data Set 9

Predicted perturbed protein-level interaction probabilities for ovary tumor network. (ZIP 1690 kb)

Supplementary Data Set 10

Predicted perturbed protein-level interaction probabilities for esophagus tumor network. (ZIP 1551 kb)

Supplementary Data Set 11

Predicted perturbed protein-level interaction probabilities for kidney tumor network. (ZIP 1506 kb)

Supplementary Data Set 12

Predicted perturbed protein-level interaction probabilities for urinary tract tumor network. (ZIP 1141 kb)

Supplementary Data Set 13

Predicted perturbed protein-level interaction probabilities for prostate tumor network. (ZIP 955 kb)

Supplementary Data Set 14

Predicted perturbed protein-level interaction probabilities for CNS tumor network. (ZIP 478 kb)

Supplementary Data Set 15

Predicted perturbed protein-level interaction probabilities for autonomic ganglia tumor network. (ZIP 508 kb)

Supplementary Data Set 16

Predicted perturbed protein-level interaction probabilities for upper aerodigestive tract tumor network. (ZIP 505 kb)

Supplementary Data Set 17

Predicted perturbed protein-level interaction probabilities for pancreas tumor network. (ZIP 470 kb)

Supplementary Data Set 18

Predicted perturbed protein-level interaction probabilities for stomach tumor network. (ZIP 523 kb)

Supplementary Data Set 19

Predicted perturbed protein-level interaction probabilities for cervix tumor network. (ZIP 342 kb)

Supplementary Data Set 20

Predicted perturbed protein-level interaction probabilities for salivary gland tumor network. (ZIP 242 kb)

Supplementary Data Set 21

Predicted perturbed protein-level interaction probabilities for bone tumor network. (ZIP 351 kb)

Supplementary Data Set 22

Predicted perturbed protein-level interaction probabilities for thyroid tumor network. (ZIP 288 kb)

Supplementary Data Set 23

Predicted perturbed protein-level interaction probabilities for meninges tumor network. (ZIP 286 kb)

Supplementary Data Set 24

Predicted perturbed protein-level interaction probabilities for adrenal gland tumor network. (ZIP 230 kb)

Supplementary Data Set 25

Predicted perturbed protein-level interaction probabilities for soft-tissue tumor network. (ZIP 287 kb)

Source data

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

AlQuraishi, M., Koytiger, G., Jenney, A. et al. A multiscale statistical mechanical framework integrates biophysical and genomic data to assemble cancer networks. Nat Genet 46, 1363–1371 (2014).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing