Past performance and future results

Tomasi, Carlo

doi:10.1038/428378a

Download PDF

News & Views
Published: 25 March 2004

Learning theory

Past performance and future results

Carlo Tomasi¹

Nature volume 428, page 378 (2004)Cite this article

650 Accesses
14 Citations
Metrics details

Learning from experience is hard, and predicting how well what we have learned will serve us in the future is even harder. The most useful lessons turn out to be those that are insensitive to small changes in our experience.

In or out: success rests on learning algorithms that are stable against slight changes in input conditions¹. Credit: FREDERIKE HELWIG/GETTY IMAGES

A hallmark of intelligent learning is that we can apply what we have learned to new situations. In the mathematical theory of learning, this ability is called generalization. On page 419 of this issue¹, Poggio et al. formulate an elegant condition for a learning system to generalize well.

As an illustration, consider practising how to hit a tennis ball. We see the trajectory of the incoming ball, and we react with complex motions of our bodies. Sometimes we hit the ball with the racket's sweet spot and send it where we want; sometimes we do less well. In the theory of supervised learning, an input–output pair exemplified by a trajectory and the corresponding reaction is called a training sample. A learning algorithm observes many training samples and computes a function that maps inputs to outputs. The learned function generalizes well if it does about as well on new inputs as on the old ones: if this is true, our performance during tennis practice is a reliable indication of how well we will play during the game.

Given an appropriate measure for the ‘cost’ of a poor hit, the algorithm could choose the least expensive function over the set of training samples, an approach to learning called empirical risk minimization. A classical result² in learning theory shows that the functions learned through empirical risk minimization generalize well only if the ‘hypothesis space’ from which they are chosen is simple enough. That there may be trouble in a poor choice of hypotheses is a familiar concept in most scientific disciplines. For instance, a high-degree polynomial fitted to a set of data points can swing wildly between them, and these swings decrease our confidence in the ability of the polynomial to make correct predictions about function values between available data points. For similar reasons, we have come to trust Kepler's simple description of the elliptical motion of heavenly bodies more than the elaborate system of deferents, epicycles and equants of Ptolemy's Almagest, no matter how well the latter fit the observations.

The classical definition of a ‘simple enough’ hypothesis space is brilliant but technically involved. For instance, the set of linear functions defined on the plane has a complexity (or Vapnik–Chervonenkis dimension²) of three because this is the greatest number of points that can be arranged on the plane so that suitable linear functions assume any desired combination of signs (positive or negative) when evaluated at the points. This definition is a mouthful already for this simple case. Although this approach has generated powerful learning algorithms², the complexity of hypothesis spaces for many realistic scenarios quickly becomes too hard to measure with this yardstick. In addition, not all learning problems can be formulated through empirical risk minimization, so classical results might not apply.

Poggio et al.¹ propose an elegant solution to these difficulties that builds on earlier intuitions^3,4,5 and shifts attention away from the hypothesis space. Instead, they require the learning algorithm to be stable if it is to produce functions that generalize well. In a nutshell, an algorithm is stable if the removal of any one training sample from any large set of samples results almost always in a small change in the learned function. Post facto, this makes intuitive sense: if removing one sample has little consequence (stability), then adding a new one should cause little surprise (generalization). For example, we expect that adding or removing an observation in Kepler's catalogue will usually not perturb his laws of planetary motion substantially.

The simplicity and generality of the stability criterion promises practical utility. For example, neuronal synapses in the brain may have to adapt (learn) with little or no memory of past training samples. In these cases, empirical risk minimization does not help, because computing the empirical risk requires access to all past inputs and outputs. In contrast, stability is a natural criterion to use in this context, because it implies predictable behaviour. In addition, stability could conceivably lead to a so-called online algorithm — that is, one that improves its output as new data become available.

Of course, stability is not the whole story, just as being able to predict our tennis performance does not mean that we will play well. If after practice we play as well as the best game contemplated in our hypothesis space, then our learning algorithm is said to be consistent. Poggio et al.¹ show that stability is equivalent to consistency for empirical risk minimization, whereas for other learning approaches stability only ensures good generalization. Even so, stability can become a practically important learning tool, as long as some key challenges are met. Specifically, Poggio et al.¹ define stability in asymptotic form, by requiring certain limits to vanish as the size of the training set becomes large. In addition, they require this to be the case for all possible probabilistic distributions of the training samples. True applicability to real situations will depend on how well these results can be rephrased for finite set sizes. In other words, can useful measures of stability and generalization be estimated from finite training samples? And is it feasible to develop statistical confidence tests for them? A new, exciting research direction has been opened.

References

Poggio, T., Rifkin, R., Mukherjee, S. & Niyogi, P. Nature 428, 419–422 (2004).
Article ADS CAS Google Scholar
Vapnik, V. N. Statistical Learning Theory (Wiley, New York, 1998).
MATH Google Scholar
Devroye, L. & Wagner, T. IEEE Trans. Information Theory 25, 601–604 (1979).
Article MathSciNet Google Scholar
Bousquet, O. & Elisseeff, A. J. Machine Learning Res. 2, 499–526 (2002).
MathSciNet Google Scholar
Kutin, S. & Niyogi, P. in Proc. 18th Conf. Uncertainty in Artificial Intelligence, Edmonton, Canada, 275–282 (Morgan Kaufmann, San Francisco, 2002).
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Duke University, Durham, 27708, North Carolina, USA
Carlo Tomasi

Authors

Carlo Tomasi
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tomasi, C. Past performance and future results. Nature 428, 378 (2004). https://doi.org/10.1038/428378a

Download citation

Issue Date: 25 March 2004
DOI: https://doi.org/10.1038/428378a

This article is cited by

Tikhonov, Ivanov and Morozov regularization for support vector machine learning
- Luca Oneto
- Sandro Ridella
- Davide Anguita
Machine Learning (2016)
Accelerating materials property predictions using machine learning
- Ghanshyam Pilania
- Chenchen Wang
- Ramamurthy Ramprasad
Scientific Reports (2013)

Past performance and future results

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

This article is cited by

Tikhonov, Ivanov and Morozov regularization for support vector machine learning

Accelerating materials property predictions using machine learning

Search

Quick links

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Tikhonov, Ivanov and Morozov regularization for support vector machine learning

Accelerating materials property predictions using machine learning

Search

Quick links