A basic problem in multivariate systems is finding predictive relations between the measurable variables in the system. The extent to which a target variable can be predicted via measurement of a set of predictor variables reveals the level of interaction between the variables. We want to quantify the degree to which the target variable is statistically determined by the predictor variables, where statistical determination does not imply physical causality. In signal processing systems, stochastic determination provides a window into the overall control structure. In the simple case of two variables possessing a bivariate normal distribution, the best predictive relation is linear and the correlation coefficient provides the desired coefficient of determination. The problem can still be framed in terms of correlation for multivariate linear systems (albeit, not in such a straightforward form), but not for nonlinear systems.

We examine quantification of nonlinear multivariate stochastic determination among gene expression levels using cDNA microarrays. The method allows incorporation of knowledge of other relevant conditions, such as the application of particular stimuli or the presence of inactivating gene mutations, as predictive elements affecting the target expression level. The approach is general and can be applied to any class of nonlinear predictor functions. For our talk, prediction is based on a ternary perceptron, to which is input one of three values for each predictor gene: +1 [up-regulated], −1 [down-regulated] or 0 [invariant]. External conditions are quantified as +1 [present] or 0 [not present]. Our reason for choosing a perceptron is twofold: first, it is intuitive; second, for n predictor variables, it has only n+1 parameters to estimate and therefore requires much less data than more general nonlinear predictors.

Owing to the large number of genes and a severe limitation on experimental replication when using microarrays, any individual predictor derived from the data lacks statistical significance relative to the population. Since our interest is measuring degrees of determination, we can derive the best predictors from the sample data and then estimate the determination coefficients for various predictor sets from the sample data. These estimates will be biased high, but can be used as sample coefficients to relatively quantify the degrees to which various expression-level sets stochastically determine a target expression level. A more accurate, more computationally costly way to proceed is to employ a bootstrap method to estimate the coefficients of determination. The bias will on average be low, but we can again obtain relative quantification.

The entire procedure is supported by software that has a number of facilities to simplify data analysis, and which provides graphics for visualising experimental data and sequential increases in determination as additional expression levels are used for prediction.