Linking data to models: data regression

Jaqaman, Khuloud; Danuser, Gaudenz

doi:10.1038/nrm2030

Review Article
Published: 27 September 2006

Linking data to models: data regression

Khuloud Jaqaman¹ &
Gaudenz Danuser¹

Nature Reviews Molecular Cell Biology volume 7, pages 813–819 (2006)Cite this article

4233 Accesses
158 Citations
4 Altmetric
Metrics details

Key Points

Mathematical modelling is an essential tool in systems biology. To ensure the accuracy of mathematical models, model parameters must be estimated using experimental data, a process called regression. Also, pre-regression and post-regression diagnostics must be employed to evaluate the model goodness-of-fit and the reliability of the estimated parameter values.
Maximum likelihood estimation and least-squares fitting are the most common regression schemes, yielding parameter values and their variance–covariance matrix. They work under the assumption that the estimated parameters have a normal distribution. When this assumption is not valid, Bayesian inference can be used, yielding the full parameter distribution.
Prior to regression, the structural identifiability of models must be assessed to determine whether model parameters can be uniquely determined and what data are required to achieve that.
Post-regression diagnostics include testing a model's goodness-of-fit, determining which model among competing ones fits the data best, evaluating parameter determinability and evaluating parameter significance.
Parameters in probabilistic models must be inferred by either indirect inference or by Bayesian methods. In indirect inference, model parameters are estimated by minimizing the differences between intermediate statistics that characterize simulated and experimental data.

Abstract

Mathematical models are an essential tool in systems biology, linking the behaviour of a system to the interactions between its components. Parameters in empirical mathematical models must be determined using experimental data, a process called regression. Because experimental data are noisy and incomplete, diagnostics that test the structural identifiability and validity of models and the significance and determinability of their parameters are needed to ensure that the proposed models are supported by the available data.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Workflow of data-driven mechanistic modelling that employs regression as well as pre-regression and post-regression diagnostics.**

**Figure 2: Diagrams of post-regression diagnostics.**

**Figure 3: Modelling of probabilistic processes.**

Gene trajectory inference for single-cell data by optimal transport metrics

Article 05 April 2024

Rihao Qu, Xiuyuan Cheng, … Yuval Kluger

Genome-wide association studies

Article 26 August 2021

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

Spike sorting with Kilosort4

Article Open access 08 April 2024

Marius Pachitariu, Shashwat Sridhar, … Carsen Stringer

References

Arkin, A. P. Synthetic cell biology. Curr. Opin. Biotechnol. 12, 638–644 (2001).
Article CAS PubMed Google Scholar
Kitano, H. Systems biology: a brief overview. Science 295, 1662–1664 (2002).
Article CAS PubMed Google Scholar
Sachs, K., Perez, O., Pe'er, D., Lauffenburger, D. A. & Nolan, G. P. Causal protein-signaling networks derived from multiparameter single-cell data. Science 308, 523–529 (2005).
Article CAS PubMed Google Scholar
Woolf, P. J., Prudhomme, W., Daheron, L., Daley, G. Q. & Lauffenburger, D. A. Bayesian analysis of signaling networks governing embryonic stem cell fate decisions. Bioinformatics 21, 741–753 (2005).
Article CAS PubMed Google Scholar
Bulashevska, S. & Eils, R. Inferring genetic regulatory logic from expression data. Bioinformatics 21, 2706–2713 (2005).
Article CAS PubMed Google Scholar
Segal, E., Friedman, N., Kaminski, N., Regev, A. & Koller, D. From signatures to models: understanding cancer using microarrays. Nature Genet. 37, S38–S45 (2005).
Article CAS PubMed Google Scholar
Janes, K. A. et al. Systems model of signaling identifies a molecular basis set for cytokine-induced apoptosis. Science 310, 1646–1653 (2005).
Article CAS PubMed Google Scholar
Janes, K. A. et al. Cue-signal-response analysis of TNF-induced apoptosis by partial least squares regression of dynamic multivariate data. J. Comp. Biol. 11, 544–561 (2004).
Article CAS Google Scholar
Heard, N. A., Holmes, C. C., Stephens, D. A., Hand, D. J. & Dimopoulos, G. Bayesian coclustering of Anopheles gene expression time series: study of immune defense response to multiple experimental challenges. Proc. Natl Acad. Sci. USA 102, 16939–16944 (2005).
Article CAS PubMed PubMed Central Google Scholar
Sprague, B. L. et al. Mechanisms of microtubule-based kinetochore positioning in the yeast metaphase spindle. Biophys. J. 84, 3529–3546 (2003).
Article CAS PubMed PubMed Central Google Scholar
Bentele, M. et al. Mathematical modeling reveals threshold mechanism in CD95-induced apoptosis. J. Cell Biol. 166, 839–851 (2004).
Article CAS PubMed PubMed Central Google Scholar
Gardner, M. K. et al. Tension-dependent regulation of microtubule dynamics at kinetochores can explain metaphase congression in yeast. Mol. Biol. Cell 16, 3764–3775 (2005).
Article CAS PubMed PubMed Central Google Scholar
Rodriguez-Fernandez, M., Mendes, P. & Banga, J. R. A hybrid approach for efficient and robust parameter estimation in biochemical pathways. Biosystems 83, 248–265 (2006).
Article CAS PubMed Google Scholar
Mendes, P. & Kell, D. B. Non-linear optimization of biochemical pathways: applications to metabolic engineering and parameter estimation. Bioinformatics 14, 869–883 (1998).
Article CAS PubMed Google Scholar
Schoeberl, B., Eichler-Jonsson, C., Gilles, E. D. & Muller, G. Computational modeling of the dynamics of the MAP kinase cascade activated by surface and internalized EGF receptors. Nature Biotechnol. 20, 370–375 (2002). Ordinary differential equation-based model of the epidermal-growth-factor-signalling network with parameters that were estimated using sensitivity analysis and least-squares regression of concentration time-courses.
Article Google Scholar
Bellman, R. & Astrom, K. J. On structural identifiability. Math. Biosci. 7, 329–339 (1970).
Article Google Scholar
Yao, K. Z., Shaw, B. M., Kou, B., McAuley, K. B. & Bacon, D. W. Modeling ethylene/butene copolymerization with multi-site catalysts: parameter estimability and experimental design. Polymer Reaction Eng. 11, 563–588 (2003).
Article CAS Google Scholar
Gadkar, K. G., Gunawan, R. & Doyle, F. J. III Iterative approach to model identification of biological networks. BMC Bioinformatics 6, 155 (2005). Presents details of structural identifiability analysis and its application to parameter estimation and optimal experimental design.
Article PubMed PubMed Central Google Scholar
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning — Data Mining, Inference and Prediction (Springer, New York, 2001).
Google Scholar
Papoulis, A. in Probability, Random Variables, and Stochastic Processes (ed. Editions, M.-H. I.) (McGraw-Hill, New York, 1991).
Google Scholar
Press, W. H., Flannery, B. P., Teukolsky, S. A. & Vetterling, W. T. Numerical Recipes in C: The Art of Scientific Computing (Cambridge Univ. Press, New York, 1992).
Google Scholar
Golub, G. H. & Van Loan, C. F. An analysis of the total least squares problem. SIAM J. Numer. Anal. 17, 883–893 (1980).
Article Google Scholar
Danuser, G. & Strickler, M. Parametric model fitting: from inlier characterization to outlier detection. IEEE Trans. Patt. Anal. Mach. Intell. 20, 263–280 (1998).
Article Google Scholar
Rousseeuw, P. J. Least median of squares regression. J. Am. Stat. Ass. 79, 871–880 (1984).
Article Google Scholar
Koch, K. -R. Parameter Estimation and Hypothesis Testing in Linear Models (Springer, Berlin, 1988).
Book Google Scholar
Pardalos, P. M. & Romeijn, H. E. Handbook of Global Optimization Volume 2 (Kluwer Academic, Dordrecht, 2002).
Book Google Scholar
Horst, R. & Pardalos, P. M. Handbook of Global Optimization (Kluwer Academic, Dodrecht, 1995).
Book Google Scholar
Moles, C. G., Mendes, P. & Banga, J. R. Parameter estimation in biochemical pathways: a comparison of global optimization methods. Genome Res. 13, 2467–2474 (2003).
Article CAS PubMed PubMed Central Google Scholar
Kleinbaum, D. G., Kupper, L. L., Muller, K. E. & Nizam, A. Applied Regression Analysis and Multivariable Methods (Duxbury, 1997).
Google Scholar
Seber, G. A. & Wild, C. J. Nonlinear Regression (Wiley-Interscience, Hoboken, 2004). References 29 and 30 are comprehensive textbooks on linear and nonlinear regression and important related diagnostics.
Google Scholar
Efron, B. Nonparametric estimates of standard error: the jackknife, the bootstrap and other methods. Biometrika 68, 589–599 (1981).
Article Google Scholar
Potvin, C. & Roff, D. A. Distribution-free and robust statistical methods: viable alternatives to parametric statistics. Ecology 74, 1617–1628 (1993).
Article Google Scholar
Coleman, M. C. & Block, D. E. Bayesian parameter estimation with informative priors for nonlinear systems. AIChE J. 52, 651–667 (2005).
Article Google Scholar
Barenco, M. et al. Ranked prediction of p53 targets using hidden variable dynamic modeling. Genome Biol. 7, R25 (2006).
Article PubMed PubMed Central Google Scholar
Chen, M., Shao, Q. & Ibrahim, J. G. Monte Carlo Methods in Bayesian Computation (Springer, New York, 2000). Presents many computational techniques for carrying out Bayesian inference.
Book Google Scholar
Schwarz, G. Estimating dimension of a model. Ann. Stat. 6, 461–464 (1978).
Article Google Scholar
Gruen, A. W. Data-processing methods for amateur photographs. Photogramm. Rec. 11, 567–579 (1985).
Article Google Scholar
Golub, G. H. & Van Loan, C. F. Matrix Computations (Johns Hopkins Univ. Press, Baltimore, 1983).
Google Scholar
Pedraza, J. M. & van Oudenaarden, A. Noise propagation in gene networks. Science 307, 1965–1969 (2005).
Article CAS PubMed Google Scholar
Rosenfeld, N., Young, J. W., Alon, U., Swain, P. S. & Elowitz, M. B. Gene regulation at the single-cell level. Science 307, 1962–1965 (2005).
Article CAS PubMed Google Scholar
Bennett, M. R. & Kearns, J. L. Statistics of transmitter release at nerve terminals. Prog. Neurobiol. 60, 545–606 (2000).
Article CAS PubMed Google Scholar
Redman, S. Quantal analysis of synaptic potentials in neurons of the central nervous system. Physiol. Rev. 70, 165–198 (1990).
Article CAS PubMed Google Scholar
Morton-Firth, C. J. & Bray, D. Predicting temporal fluctuations in an intracellular signalling pathway. J. Theor. Biol. 192, 117–128 (1998).
Article CAS PubMed Google Scholar
Spudich, J. L. & Koshland, D. E. Non-genetic individuality — chance in single cell. Nature 262, 467–471 (1976).
Article CAS PubMed Google Scholar
Mitchison, T. & Kirschner, M. Dynamic instability of microtubule growth. Nature 312, 237–242 (1984).
Article CAS PubMed Google Scholar
Smith, A. A. Jr Estimating nonlinear time-series models using simulated vector autoregression. J. Appl. Econometrics 8, S63–S84 (1993). Introduces methods of indirect inference for the estimation of parameters in probabilistic models.
Article Google Scholar
Gourieroux, C., Monfort, A. & Renault, E. Indirect inference. J. Appl. Econometrics 8, S85–S118 (1993).
Article Google Scholar
Jiang, W. & Turnbull, B. The indirect method: inference based on intermediate statistics — a synthesis and examples. Stat. Sci. 19, 239–263 (2004).
Article Google Scholar
Gallant, A. R. & Tauchen, G. Which moments to match? Econometric Theory 12, 657–681 (1996).
Article Google Scholar
Golightly, A. & Wilkinson, D. J. Bayesian sequential inference for stochastic kinetic biochemical network models. J. Comp. Biol. 13, 838–851 (2006).
Article CAS Google Scholar
O'Neill, P. D. & Roberts, G. O. Bayesian inference for partially observed stochastic epidemics. J. Royal Stat. Soc. A 162, 121–129 (1999).
Article Google Scholar
Gibson, G. J., Kleczkowski, A. & Gilligan, C. A. Bayesian analysis of botanical epidemics using stochastic compartmental models. Proc. Natl Acad. Sci. USA 101, 12120–12124 (2004).
Article CAS PubMed PubMed Central Google Scholar
Smith, A. F. M. & Roberts, G. O. Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo Methods. J. Royal Stat. Soc. B 55, 3–23 (1993).
Google Scholar
Wilkinson, D. J. Stochastic Modelling for Systems Biology (CRC Press, Boca Raton, 2006). Discusses issues that are related to probabilistic modelling and the estimation of parameters in stochastic models using Bayesian inference.
Google Scholar
Walker, R. A. et al. Dynamic instability of individual microtubules analyzed by video light-microscopy — rate constants and transition frequencies. J. Cell Biol. 107, 1437–1448 (1988).
Article CAS PubMed Google Scholar
Shaw, S. L., Yeh, E., Maddox, P., Salmon, E. D. & Bloom, K. Astral microtubule dynamics in yeast: a microtubule-based searching mechanism for spindle orientation and nuclear migration into the bud. J. Cell Biol. 139, 985–994 (1997).
Article CAS PubMed PubMed Central Google Scholar
Odde, D. J., Cassimeris, L. & Buettner, H. M. Kinetics of microtubule catastrophe assessed by probabilistic analysis. Biophys. J. 69, 796–802 (1995).
Article CAS PubMed PubMed Central Google Scholar
Gildersleeve, R. F., Cross, A. R., Cullen, K. E., Fagen, A. P. & Williams, R. C. Microtubules grow and shorten at intrinsically variable rates. J. Biol. Chem. 267, 7995–8006 (1992).
CAS PubMed Google Scholar
Dorn, J. F. et al. Interphase kinetochore microtubule dynamics in yeast analyzed by high-resolution microscopy. Biophys. J. 89, 2835–2854 (2005).
Article CAS PubMed PubMed Central Google Scholar
Jaqaman, K. et al. Comparative autoregressive moving average analysis of kinetochore microtubule dynamics in yeast. Biophys. J. 91, 2312–2325 (2006).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported in part by a National Institutes of Health grant. K.J. is a Paul Sigler/Agouron fellow of the Helen Hay Whitney Foundation.

Author information

Authors and Affiliations

Department of Cell Biology, the Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, 92037, California, USA
Khuloud Jaqaman & Gaudenz Danuser

Authors

Khuloud Jaqaman
View author publications
You can also search for this author in PubMed Google Scholar
Gaudenz Danuser
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gaudenz Danuser.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary information S1 (table) (PDF 156 kb)

Glossary

Structural identifiability: A model is structurally identifiable if its parameters can be uniquely estimated by fitting the model to experimental data. Structural identifiability is related to the sensitivity of process output to parameter variations.
Variance: A measure of the dispersion of a variable around its average. Its square root is the standard deviation.
Covariance: A measure of how two variables vary relative to each other.
Significance: A parameter is statistically significant if, given the uncertainty in its estimate due to noise in the input data, the probability that the parameter magnitude is different from zero not just by chance exceeds the confidence required by the investigator.
Determinability: A measure for the capability to infer the value of a model parameter from the available input data, independent of the values of other parameters.
Regression instability: A measure for the variation of regression results in the presence of data noise. A regression is unstable if the estimates of model parameters significantly differ when one additional data point is added to the set of input data.
Linear independence: A set of parameters is linearly independent if none of its parameters can be written as a linear combination of the other parameters.
Normal distribution: A bell-shaped distribution that is fully characterized by its mean μ and variance σ². It is usually written as N(μ, σ²).
Residual: The difference between an observation and the corresponding model prediction.
Robust: An estimation technique is said to be robust if it is insensitive to deviations in the model and the input data from the ideal assumptions about them that were used in formulating the estimation process.
Outlier: A data point with an error that does not belong to the assumed distribution of measurement errors.
Lorentzian distribution: A distribution that resembles the normal distribution, but with lower probability for values that are close to the mean and higher probability for values that are farther from the mean.
Linear: The models y = α₁x + α₂x² and y = αexp(−x) are linear functions of the parameters α.
Nonlinear: The models y = (α₁x + α²x)² and y = α₁exp(−α₂x) are nonlinear functions of the parameters α.
Closed-form solution: A solution that can be expressed analytically in terms of a finite number of operations (for example, addition, multiplication, square root, and so on).
Global optimization: The search for the lowest minimum or highest maximum of an objective function that has multiple minima or maxima. Such a function is called non-convex.
Central limit theorem: The central limit theorem states that any variable that is calculated as the sum of a large number of variables, even if they are not normally distributed, will be normally distributed.
Nonparametric methods: Statistical methods that do not assume an underlying distribution for the data being analysed.
Number of degrees of freedom: The number of degrees of freedom in a regression is the number of data points that were used in the regression minus the number of estimated parameters.
Null hypothesis: A statement that is tested for possible rejection under the assumption that it is true.
Alternative hypothesis: A statement that is placed in opposition to the null hypothesis.
Test-statistic: The variable calculated from the available data in order to test whether the null hypothesis can be rejected. Its distribution under the null hypothesis is usually known.
Chi-square distribution: A variable that is calculated as the sum of the squares of ν variables that are N(0,1)-distributed has a Chi square (χ²)-distribution with ν degrees of freedom.
P-value: The probability of obtaining a test-statistic at least as extreme as the one observed, assuming that the null hypothesis is true. It is effectively the probability of wrongly rejecting the null hypothesis when it is actually true.
Significance value: The value below which a p-value supports rejecting the null hypothesis.
F-distribution: A variable that is calculated as the ratio of two Chi-square-distributed variables divided by their degrees of freedom ν₁ and ν₂, has an F-distribution with ν₁ and ν₂ degrees of freedom.
Trace: Sum of the diagonal elements of a matrix.
Student's t-distribution: A distribution that is similar to N(0,1), except that it has heavier tails. It is a function of the number of degrees of freedom ν, and converges to N(0,1) as ν gets larger.
Probabilistic process: A process in which the current state of a system does not uniquely determine its next state, but defines a set of possible states with their transition probabilities.
Markov chain: A chain of events in which what happens at time point t + 1 only depends on what has happened at time point t, and not on any previous time points.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jaqaman, K., Danuser, G. Linking data to models: data regression. Nat Rev Mol Cell Biol 7, 813–819 (2006). https://doi.org/10.1038/nrm2030

Download citation

Published: 27 September 2006
Issue Date: 01 November 2006
DOI: https://doi.org/10.1038/nrm2030

This article is cited by

MLAGO: machine learning-aided global optimization for Michaelis constant estimation of kinetic modeling
- Kazuhiro Maeda
- Aoi Hatae
- Hiroyuki Kurata
BMC Bioinformatics (2022)
Simulation modelling for immunologists
- Andreas Handel
- Nicole L. La Gruta
- Paul G. Thomas
Nature Reviews Immunology (2020)
Parameter estimation in models of biological oscillators: an automated regularised estimation approach
- Jake Alan Pitt
- Julio R. Banga
BMC Bioinformatics (2019)
Ranking network mechanisms by how they fit diverse experiments and deciding on E. coli's ammonium transport and assimilation network
- Kazuhiro Maeda
- Hans V. Westerhoff
- Fred C. Boogerd
npj Systems Biology and Applications (2019)
Glutamate spillover in C. elegans triggers repetitive behavior through presynaptic activation of MGL-2/mGluR5
- Menachem Katz
- Francis Corson
- Shai Shaham
Nature Communications (2019)

Linking data to models: data regression

Key Points

Abstract

Access options

Similar content being viewed by others

Gene trajectory inference for single-cell data by optimal transport metrics

Genome-wide association studies

Spike sorting with Kilosort4

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Supplementary information

Supplementary information S1 (table) (PDF 156 kb)

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

This article is cited by

MLAGO: machine learning-aided global optimization for Michaelis constant estimation of kinetic modeling

Simulation modelling for immunologists

Parameter estimation in models of biological oscillators: an automated regularised estimation approach

Ranking network mechanisms by how they fit diverse experiments and deciding on E. coli's ammonium transport and assimilation network

Glutamate spillover in C. elegans triggers repetitive behavior through presynaptic activation of MGL-2/mGluR5

Search

Quick links

Key Points

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Supplementary information

Related links

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links