New experimental techniques are allowing the generation of complex data sets that characterize signal-transduction networks. It is no longer possible to inspect these data by intuition to extract the maximal amount of information that is embedded within them.
'Data-driven models' are mathematical approaches that provide simplified representations of complex data sets. They are based solely on analysing the data itself, without having to make any assumptions about the underlying mechanisms.
This User's guide introduces three data-driven modelling approaches: clustering, principal components analysis (PCA), and partial least squares (PLS). Clustering provides a means for data organization, whereas PCA is a method for data condensation and PLS is a technique for data prediction.
Clustering groups observations together that have similar projections in the high-dimensional space defined by the signalling variables. Similarity can be defined by several difference distance metrics, such as Euclidean distance (for absolute distances) and Pearson distance (for correlations).
PCA and PLS factorize a data set into the product of two vectors (a scores vector and a loadings vector) that capture the leading eigenvalues of the covariance of the data. PCA calculates scores and loadings vectors to maximize the variance that is captured in the starting data matrix. By contrast, PLS calculates scores and loadings vectors to maximize the relationship between a matrix of independent variables and a matrix of dependent variables.
Data-driven models are poised to become standard tools in analysing signalling networks as complex protein data sets become easier to acquire and more difficult to interpret.
New technologies are permitting large-scale quantitative studies of signal-transduction networks. Such data are hard to understand completely by inspection and intuition. 'Data-driven models' help users to analyse large data sets by simplifying the measurements themselves. Data-driven modelling approaches such as clustering, principal components analysis and partial least squares can derive biological insights from large-scale experiments. These models are emerging as standard tools for systems-level research in signalling networks.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
Scientific Reports Open Access 08 June 2021
Nature Communications Open Access 09 March 2021
Scientific Reports Open Access 23 December 2019
Subscribe to Journal
Get full journal access for 1 year
only $8.25 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
Janes, K. A. et al. A high-throughput quantitative multiplex kinase assay for monitoring information flow in signaling networks: application to sepsis-apoptosis. Mol. Cell Proteomics 2, 463–473 (2003).
Kingsmore, S. F. Multiplexed protein measurement: technologies and applications of protein and antibody arrays. Nature Rev. Drug Discov. 5, 310–320 (2006).
Ong, S. E. & Mann, M. Mass spectrometry-based proteomics turns quantitative. Nature Chem. Biol. 1, 252–262 (2005).
Irish, J. M., Kotecha, N. & Nolan, G. P. Mapping normal and cancer cell signalling networks: towards single-cell proteomics. Nature Rev. Cancer 6, 146–155 (2006).
Gaudet, S. et al. A compendium of signals and responses triggered by prodeath and prosurvival cytokines. Mol. Cell Proteomics 4, 1569–1590 (2005). References 3–5 are excellent reviews on emerging technologies for large-scale studies of signal-transduction networks.
Janes, K. A. et al. The response of human epithelial cells to TNF involves an inducible autocrine cascade. Cell 124, 1225–1239 (2006). This study applied data-driven modelling to a large-scale proteomic compendium and showed that tumour necrosis factor induces a regulated, interdependent cascade of autocrine cytokines.
Jones, R. B., Gordus, A., Krall, J. A. & MacBeath, G. A quantitative protein interaction network for the ErbB receptors using protein microarrays. Nature 439, 168–174 (2006).
Blagoev, B., Ong, S. E., Kratchmarova, I. & Mann, M. Temporal analysis of phosphotyrosine-dependent signaling networks by quantitative proteomics. Nature Biotechnol. 22, 1139–1145 (2004).
Irish, J. M. et al. Single cell profiling of potentiated phospho-protein networks in cancer cells. Cell 118, 217–228 (2004).
Natarajan, M., Lin, K. M., Hsueh, R. C., Sternweis, P. C. & Ranganathan, R. A global analysis of cross-talk in a mammalian cellular signalling network. Nature Cell Biol. 8, 571–580 (2006). The first data-driven analysis of the one- and two-ligand screens for macrophage signalling that was organized by the Alliance for Cell Signaling. The results show how crosstalk is widespread but not uniformly distributed across all ligands and signalling molecules.
Bray, D. Reasoning for results. Nature 412, 863 (2001).
Janes, K. A. & Lauffenburger, D. A. A biological approach to computational models of proteomic networks. Curr. Opin. Chem. Biol. 10, 73–80 (2006).
Pawson, T. Specificity in signal transduction: from phosphotyrosine–SH2 domain interactions to complex cellular systems. Cell 116, 191–203 (2004).
Hunter, T. Signaling — 2000 and beyond. Cell 100, 113–127 (2000).
Janes, K. A. et al. Cue-signal-response analysis of TNF-induced apoptosis by partial least squares regression of dynamic multivariate data. J. Comput. Biol. 11, 544–561 (2004).
D'Haeseleer, P. How does gene expression clustering work? Nature Biotechnol. 23, 1499–1501 (2005).
Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E. & Ruzzo, W. L. Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 977–987 (2001).
Yeung, K. Y., Haynor, D. R. & Ruzzo, W. L. Validating clustering for gene expression data. Bioinformatics 17, 309–318 (2001).
Schuldiner, M. et al. Exploration of the function and organization of the yeast early secretory pathway through an epistatic miniarray profile. Cell 123, 507–519 (2005).
Perlman, Z. E. et al. Multidimensional drug profiling by automated microscopy. Science 306, 1194–1198 (2004).
Bjorklund, M. et al. Identification of pathways regulating cell size and cell-cycle progression by RNAi. Nature 439, 1009–1013 (2006).
Gilchrist, M. et al. Systems biology approaches identify ATF3 as a negative regulator of Toll-like receptor 4. Nature 441, 173–178 (2006).
Geladi, P. & Kowalski, B. R. Partial least-squares regression — a tutorial. Anal. Chim. Acta 185, 1–17 (1986). The classic review on partial least squares. The tutorial is presented in the context of spectroscopy, but the analytical approaches can be applied equally well to biological systems.
Briggman, K. L., Abarbanel, H. D. & Kristan, W. B. Jr. Optical imaging of neuronal populations during decision-making. Science 307, 896–901 (2005).
Hallem, E. A. & Carlson, J. R. Coding of odors by a receptor repertoire. Cell 125, 143–160 (2006).
Butte, A. The use and analysis of microarray data. Nature Rev. Drug Discov. 1, 951–960 (2002).
Tanaka, M. et al. An unbiased cell morphology-based screen for new, biologically active small molecules. PLoS Biol. 3, e128 (2005).
Knight, Z. A. et al. A pharmacological map of the PI3-K family defines a role for p110α in insulin signaling. Cell 125, 733–747 (2006).
Haggarty, S. J., Koeller, K. M., Wong, J. C., Butcher, R. A. & Schreiber, S. L. Multidimensional chemical genetic analysis of diversity-oriented synthesis-derived deacetylase inhibitors using cell-based assays. Chem. Biol. 10, 383–396 (2003).
Hirai, M. Y. et al. Integration of transcriptomics and metabolomics for understanding of global responses to nutritional stresses in Arabidopsis thaliana. Proc. Natl Acad. Sci. USA 101, 10205–10210 (2004).
Liu, G., Swihart, M. T. & Neelamegham, S. Sensitivity, principal component and flux analysis applied to signal transduction: the case of epidermal growth factor mediated signaling. Bioinformatics 21, 1194–1202 (2005).
Janes, K. A. et al. A systems model of signaling identifies a molecular basis set for cytokine-induced apoptosis. Science 310, 1646–1653 (2005).
Nguyen, D. V. & Rocke, D. M. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18, 39–50 (2002).
Jessen, F., Lametsch, R., Bendixen, E., Kjaersgard, I. V. & Jorgensen, B. M. Extracting information from two-dimensional electrophoresis gels by partial least squares regression. Proteomics 2, 32–35 (2002). These three papers are the first applications of PLS for classification (references 33 and 34) and prediction (reference 32) using biological networks.
Hood, L., Heath, J. R., Phelps, M. E. & Lin, B. Systems biology and new technologies enable predictive and preventative medicine. Science 306, 640–643 (2004).
Goncalves, A. et al. Postoperative serum proteomic profiles may predict metastatic relapse in high-risk primary breast cancer patients receiving adjuvant chemotherapy. Oncogene 25, 981–989 (2006).
Linke, S. P., Bremer, T. M., Herold, C. D., Sauter, G. & Diamond, C. A multimarker model to predict outcome in tamoxifen-treated breast cancer patients. Clin. Cancer Res. 12, 1175–1183 (2006).
Liao, J. C. et al. Network component analysis: reconstruction of regulatory signals in biological systems. Proc. Natl Acad. Sci. USA 100, 15522–15527 (2003). This paper is the first introduction of NCA and its proof-of-principle application to biological networks.
Martens, H. & Martens, M. Multivariate Analysis of Quality: An Introduction (John Wiley & Sons, Chichester, 2001).
Grossman, R. L., Kamath, C., Kegelmeyer, P., Kumar, V. & Namburu, R. Data Mining for Scientific and Engineering Applications (Kluwer Academic, Dordrecht, 2001).
Gilman, A. G. et al. Overview of the Alliance for Cellular Signaling. Nature 420, 703–706 (2002).
Pradervand, S., Maurya, M. R. & Subramaniam, S. Identification of signaling components required for the prediction of cytokine release in RAW 264.7 macrophages. Genome Biol. 7, R11 (2006).
Kitano, H. Systems biology: a brief overview. Science 295, 1662–1664 (2002).
MacQueen, J. B. in Proceedings of 5 th Berkeley Symposium on Mathematical Statistics and Probability 281–297 (University of California Press, Berkeley, 1967).
Bezdek, J. C. Pattern Recognition with Fuzzy Objective Function Algorithms (Plenum, New York, 1981).
The work cited in this review was supported by grants from the National Institutes of Health to M.B.Y. and an American Cancer Society postdoctoral fellowship to K.A.J.
The authors declare no competing financial interests.
A table of numbers. Alternatively, a matrix can be viewed as an arrangement of row or column vectors.
A mathematical quantity that has both magnitude (or length) and direction. The entries of a vector specify the magnitude of its projection in different directions.
- Linear algebra
A branch of mathematics that involves linear manipulations of vectors and matrices.
A mathematical function that can be applied to vectors and matrices.
- Row vector
A vector that is composed of one entire row of a matrix with dimensions that are specified by the matrix columns.
- Euclidean distance
A mathematical quantity that calculates the measurable geometric distance between two vectors pointing from a common origin.
- Column vector
A vector that is composed of one entire column of a matrix with dimensions that are specified by the matrix rows.
- Pearson distance
A mathematical quantity that calculates the difference in direction between two vectors pointing from a common origin.
- k-means clustering
A clustering technique in which observations are grouped into a fixed number of pre-specified clusters called centroids.
A mathematical quantity that provides the scaling factor for an eigenvector of a given transformation. For PCA, eigenvalues quantify the contribution of different portions of the data set to the overall measured variation.
- Scores vector
The principal component vector that describes how strongly each observation projects along the principal component.
- Loadings vector
The principal component vector that describes how strongly each measured signal contributes to the principal component.
- Unsupervised analysis
A type of computational learning approach in which the expected output is not specified. Hierarchical clustering and PCA are unsupervised analyses.
About this article
Cite this article
Janes, K., Yaffe, M. Data-driven modelling of signal-transduction networks. Nat Rev Mol Cell Biol 7, 820–828 (2006). https://doi.org/10.1038/nrm2041
This article is cited by
Nature Communications (2021)
Scientific Reports (2021)
Scientific Reports (2019)
Experimental & Molecular Medicine (2018)
Systems analysis of latent HIV reversal reveals altered stress kinase signaling and increased cell death in infected T cells
Scientific Reports (2017)