Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Data-driven modelling of signal-transduction networks

Key Points

  • New experimental techniques are allowing the generation of complex data sets that characterize signal-transduction networks. It is no longer possible to inspect these data by intuition to extract the maximal amount of information that is embedded within them.

  • 'Data-driven models' are mathematical approaches that provide simplified representations of complex data sets. They are based solely on analysing the data itself, without having to make any assumptions about the underlying mechanisms.

  • This User's guide introduces three data-driven modelling approaches: clustering, principal components analysis (PCA), and partial least squares (PLS). Clustering provides a means for data organization, whereas PCA is a method for data condensation and PLS is a technique for data prediction.

  • Clustering groups observations together that have similar projections in the high-dimensional space defined by the signalling variables. Similarity can be defined by several difference distance metrics, such as Euclidean distance (for absolute distances) and Pearson distance (for correlations).

  • PCA and PLS factorize a data set into the product of two vectors (a scores vector and a loadings vector) that capture the leading eigenvalues of the covariance of the data. PCA calculates scores and loadings vectors to maximize the variance that is captured in the starting data matrix. By contrast, PLS calculates scores and loadings vectors to maximize the relationship between a matrix of independent variables and a matrix of dependent variables.

  • Data-driven models are poised to become standard tools in analysing signalling networks as complex protein data sets become easier to acquire and more difficult to interpret.


New technologies are permitting large-scale quantitative studies of signal-transduction networks. Such data are hard to understand completely by inspection and intuition. 'Data-driven models' help users to analyse large data sets by simplifying the measurements themselves. Data-driven modelling approaches such as clustering, principal components analysis and partial least squares can derive biological insights from large-scale experiments. These models are emerging as standard tools for systems-level research in signalling networks.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: Alternative representations of a systems biology data set.
Figure 2: Clustering of row and column vectors by different distance metrics.
Figure 3: Principal components identified by PCA and PLS.


  1. Janes, K. A. et al. A high-throughput quantitative multiplex kinase assay for monitoring information flow in signaling networks: application to sepsis-apoptosis. Mol. Cell Proteomics 2, 463–473 (2003).

    CAS  Article  Google Scholar 

  2. Kingsmore, S. F. Multiplexed protein measurement: technologies and applications of protein and antibody arrays. Nature Rev. Drug Discov. 5, 310–320 (2006).

    CAS  Article  Google Scholar 

  3. Ong, S. E. & Mann, M. Mass spectrometry-based proteomics turns quantitative. Nature Chem. Biol. 1, 252–262 (2005).

    CAS  Article  Google Scholar 

  4. Irish, J. M., Kotecha, N. & Nolan, G. P. Mapping normal and cancer cell signalling networks: towards single-cell proteomics. Nature Rev. Cancer 6, 146–155 (2006).

    CAS  Article  Google Scholar 

  5. Gaudet, S. et al. A compendium of signals and responses triggered by prodeath and prosurvival cytokines. Mol. Cell Proteomics 4, 1569–1590 (2005). References 3–5 are excellent reviews on emerging technologies for large-scale studies of signal-transduction networks.

    CAS  Article  Google Scholar 

  6. Janes, K. A. et al. The response of human epithelial cells to TNF involves an inducible autocrine cascade. Cell 124, 1225–1239 (2006). This study applied data-driven modelling to a large-scale proteomic compendium and showed that tumour necrosis factor induces a regulated, interdependent cascade of autocrine cytokines.

    CAS  Article  Google Scholar 

  7. Jones, R. B., Gordus, A., Krall, J. A. & MacBeath, G. A quantitative protein interaction network for the ErbB receptors using protein microarrays. Nature 439, 168–174 (2006).

    CAS  Article  Google Scholar 

  8. Blagoev, B., Ong, S. E., Kratchmarova, I. & Mann, M. Temporal analysis of phosphotyrosine-dependent signaling networks by quantitative proteomics. Nature Biotechnol. 22, 1139–1145 (2004).

    CAS  Article  Google Scholar 

  9. Irish, J. M. et al. Single cell profiling of potentiated phospho-protein networks in cancer cells. Cell 118, 217–228 (2004).

    CAS  Article  Google Scholar 

  10. Natarajan, M., Lin, K. M., Hsueh, R. C., Sternweis, P. C. & Ranganathan, R. A global analysis of cross-talk in a mammalian cellular signalling network. Nature Cell Biol. 8, 571–580 (2006). The first data-driven analysis of the one- and two-ligand screens for macrophage signalling that was organized by the Alliance for Cell Signaling. The results show how crosstalk is widespread but not uniformly distributed across all ligands and signalling molecules.

    CAS  Article  Google Scholar 

  11. Bray, D. Reasoning for results. Nature 412, 863 (2001).

    CAS  Article  Google Scholar 

  12. Janes, K. A. & Lauffenburger, D. A. A biological approach to computational models of proteomic networks. Curr. Opin. Chem. Biol. 10, 73–80 (2006).

    CAS  Article  Google Scholar 

  13. Pawson, T. Specificity in signal transduction: from phosphotyrosine–SH2 domain interactions to complex cellular systems. Cell 116, 191–203 (2004).

    CAS  Article  Google Scholar 

  14. Hunter, T. Signaling — 2000 and beyond. Cell 100, 113–127 (2000).

    CAS  Article  Google Scholar 

  15. Janes, K. A. et al. Cue-signal-response analysis of TNF-induced apoptosis by partial least squares regression of dynamic multivariate data. J. Comput. Biol. 11, 544–561 (2004).

    CAS  Article  Google Scholar 

  16. D'Haeseleer, P. How does gene expression clustering work? Nature Biotechnol. 23, 1499–1501 (2005).

    CAS  Article  Google Scholar 

  17. Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E. & Ruzzo, W. L. Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 977–987 (2001).

    CAS  Article  Google Scholar 

  18. Yeung, K. Y., Haynor, D. R. & Ruzzo, W. L. Validating clustering for gene expression data. Bioinformatics 17, 309–318 (2001).

    CAS  Article  Google Scholar 

  19. Schuldiner, M. et al. Exploration of the function and organization of the yeast early secretory pathway through an epistatic miniarray profile. Cell 123, 507–519 (2005).

    CAS  Article  Google Scholar 

  20. Perlman, Z. E. et al. Multidimensional drug profiling by automated microscopy. Science 306, 1194–1198 (2004).

    CAS  Article  Google Scholar 

  21. Bjorklund, M. et al. Identification of pathways regulating cell size and cell-cycle progression by RNAi. Nature 439, 1009–1013 (2006).

    Article  Google Scholar 

  22. Gilchrist, M. et al. Systems biology approaches identify ATF3 as a negative regulator of Toll-like receptor 4. Nature 441, 173–178 (2006).

    CAS  Article  Google Scholar 

  23. Geladi, P. & Kowalski, B. R. Partial least-squares regression — a tutorial. Anal. Chim. Acta 185, 1–17 (1986). The classic review on partial least squares. The tutorial is presented in the context of spectroscopy, but the analytical approaches can be applied equally well to biological systems.

    CAS  Article  Google Scholar 

  24. Briggman, K. L., Abarbanel, H. D. & Kristan, W. B. Jr. Optical imaging of neuronal populations during decision-making. Science 307, 896–901 (2005).

    CAS  Article  Google Scholar 

  25. Hallem, E. A. & Carlson, J. R. Coding of odors by a receptor repertoire. Cell 125, 143–160 (2006).

    CAS  Article  Google Scholar 

  26. Butte, A. The use and analysis of microarray data. Nature Rev. Drug Discov. 1, 951–960 (2002).

    CAS  Article  Google Scholar 

  27. Tanaka, M. et al. An unbiased cell morphology-based screen for new, biologically active small molecules. PLoS Biol. 3, e128 (2005).

    Article  Google Scholar 

  28. Knight, Z. A. et al. A pharmacological map of the PI3-K family defines a role for p110α in insulin signaling. Cell 125, 733–747 (2006).

    CAS  Article  Google Scholar 

  29. Haggarty, S. J., Koeller, K. M., Wong, J. C., Butcher, R. A. & Schreiber, S. L. Multidimensional chemical genetic analysis of diversity-oriented synthesis-derived deacetylase inhibitors using cell-based assays. Chem. Biol. 10, 383–396 (2003).

    CAS  Article  Google Scholar 

  30. Hirai, M. Y. et al. Integration of transcriptomics and metabolomics for understanding of global responses to nutritional stresses in Arabidopsis thaliana. Proc. Natl Acad. Sci. USA 101, 10205–10210 (2004).

    CAS  Article  Google Scholar 

  31. Liu, G., Swihart, M. T. & Neelamegham, S. Sensitivity, principal component and flux analysis applied to signal transduction: the case of epidermal growth factor mediated signaling. Bioinformatics 21, 1194–1202 (2005).

    CAS  Article  Google Scholar 

  32. Janes, K. A. et al. A systems model of signaling identifies a molecular basis set for cytokine-induced apoptosis. Science 310, 1646–1653 (2005).

    CAS  Article  Google Scholar 

  33. Nguyen, D. V. & Rocke, D. M. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18, 39–50 (2002).

    CAS  Article  Google Scholar 

  34. Jessen, F., Lametsch, R., Bendixen, E., Kjaersgard, I. V. & Jorgensen, B. M. Extracting information from two-dimensional electrophoresis gels by partial least squares regression. Proteomics 2, 32–35 (2002). These three papers are the first applications of PLS for classification (references 33 and 34) and prediction (reference 32) using biological networks.

    CAS  Article  Google Scholar 

  35. Hood, L., Heath, J. R., Phelps, M. E. & Lin, B. Systems biology and new technologies enable predictive and preventative medicine. Science 306, 640–643 (2004).

    CAS  Article  Google Scholar 

  36. Goncalves, A. et al. Postoperative serum proteomic profiles may predict metastatic relapse in high-risk primary breast cancer patients receiving adjuvant chemotherapy. Oncogene 25, 981–989 (2006).

    CAS  Article  Google Scholar 

  37. Linke, S. P., Bremer, T. M., Herold, C. D., Sauter, G. & Diamond, C. A multimarker model to predict outcome in tamoxifen-treated breast cancer patients. Clin. Cancer Res. 12, 1175–1183 (2006).

    CAS  Article  Google Scholar 

  38. Liao, J. C. et al. Network component analysis: reconstruction of regulatory signals in biological systems. Proc. Natl Acad. Sci. USA 100, 15522–15527 (2003). This paper is the first introduction of NCA and its proof-of-principle application to biological networks.

    CAS  Article  Google Scholar 

  39. Martens, H. & Martens, M. Multivariate Analysis of Quality: An Introduction (John Wiley & Sons, Chichester, 2001).

    Google Scholar 

  40. Grossman, R. L., Kamath, C., Kegelmeyer, P., Kumar, V. & Namburu, R. Data Mining for Scientific and Engineering Applications (Kluwer Academic, Dordrecht, 2001).

    Book  Google Scholar 

  41. Gilman, A. G. et al. Overview of the Alliance for Cellular Signaling. Nature 420, 703–706 (2002).

    CAS  Article  Google Scholar 

  42. Pradervand, S., Maurya, M. R. & Subramaniam, S. Identification of signaling components required for the prediction of cytokine release in RAW 264.7 macrophages. Genome Biol. 7, R11 (2006).

  43. Kitano, H. Systems biology: a brief overview. Science 295, 1662–1664 (2002).

    CAS  Article  Google Scholar 

  44. MacQueen, J. B. in Proceedings of 5 th Berkeley Symposium on Mathematical Statistics and Probability 281–297 (University of California Press, Berkeley, 1967).

    Google Scholar 

  45. Bezdek, J. C. Pattern Recognition with Fuzzy Objective Function Algorithms (Plenum, New York, 1981).

    Book  Google Scholar 

Download references


The work cited in this review was supported by grants from the National Institutes of Health to M.B.Y. and an American Cancer Society postdoctoral fellowship to K.A.J.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Michael B. Yaffe.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Related links

Related links


Michael B. Yaffe's homepage



A table of numbers. Alternatively, a matrix can be viewed as an arrangement of row or column vectors.


A mathematical quantity that has both magnitude (or length) and direction. The entries of a vector specify the magnitude of its projection in different directions.

Linear algebra

A branch of mathematics that involves linear manipulations of vectors and matrices.


A mathematical function that can be applied to vectors and matrices.

Row vector

A vector that is composed of one entire row of a matrix with dimensions that are specified by the matrix columns.

Euclidean distance

A mathematical quantity that calculates the measurable geometric distance between two vectors pointing from a common origin.

Column vector

A vector that is composed of one entire column of a matrix with dimensions that are specified by the matrix rows.

Pearson distance

A mathematical quantity that calculates the difference in direction between two vectors pointing from a common origin.

k-means clustering

A clustering technique in which observations are grouped into a fixed number of pre-specified clusters called centroids.


A mathematical quantity that provides the scaling factor for an eigenvector of a given transformation. For PCA, eigenvalues quantify the contribution of different portions of the data set to the overall measured variation.

Scores vector

The principal component vector that describes how strongly each observation projects along the principal component.

Loadings vector

The principal component vector that describes how strongly each measured signal contributes to the principal component.

Unsupervised analysis

A type of computational learning approach in which the expected output is not specified. Hierarchical clustering and PCA are unsupervised analyses.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Janes, K., Yaffe, M. Data-driven modelling of signal-transduction networks. Nat Rev Mol Cell Biol 7, 820–828 (2006).

Download citation

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing