Introduction

When we learn something new, like a language or musical instrument, we often seek challenges at the edge of our competence—not so hard that we are discouraged, but not so easy that we get bored. This simple intuition, that there is a sweet spot of difficulty, a ‘Goldilocks zone’1, for motivation and learning is at the heart of modern teaching methods2 and is thought to account for differences in infant attention between more and less learnable stimuli1. In the animal learning literature it is the intuition behind shaping3 and fading4, whereby complex tasks are taught by steadily increasing the difficulty of a training task. It is also observable in the nearly universal ‘levels’ feature in video games, in which the player is encouraged, or even forced, to a higher level of difficulty once a performance criterion has been achieved. Similarly in machine learning, steadily increasing the difficulty of training has proven useful for teaching large scale neural networks in a variety of tasks5,6, where it is known as ‘Curriculum Learning’7 and ‘Self-Paced Learning’8.

Despite this long history of empirical results, it is unclear why a particular difficulty level may be beneficial for learning nor what that optimal level might be. In this paper we address this issue of optimal training difficulty for a broad class of learning algorithms in the context of binary classification tasks, in which ambiguous stimuli must be classified into one of two classes (e.g., cat or dog).

In particular, we focus on the class of stochastic gradient-descent based learning algorithms. In these algorithms, parameters of the model (e.g., the weights in a neural network) are adjusted based on feedback in such a way as to reduce the average error rate over time9. That is, these algorithms descend the gradient of error rate as a function of model parameters. Such gradient-descent learning forms the basis of many algorithms in AI, from single-layer perceptrons to deep neural networks10, and provides a quantitative description of human and animal learning in a variety of situations, from perception11, to motor control12 to reinforcement learning13. For these algorithms, we provide a general result for the optimal difficulty in terms of a target error rate for training. Under the assumption of a Gaussian noise process underlying the errors, this optimal error rate is around 15.87%, a number that varies slightly depending on the noise in the learning process. That is the optimal accuracy for training is around 85%. We show theoretically that training at this optimal difficulty can lead to exponential improvements in the rate of learning. Finally, we demonstrate the applicability of the Eighty Five Percent Rule to artificial one- and two-layer neural networks9,14, and a model from computational neuroscience that is thought to describe human and animal perceptual learning11.

Results

Optimal training difficulty for binary classification tasks

In a standard binary classification task, an animal or machine ‘agent’ makes binary decisions about simple stimuli. For example, in the classic Random Dot Motion paradigm from Psychology and Neuroscience15,16, stimuli consist of a patch of moving dots—most moving randomly but a small fraction moving coherently either to the left or the right—and participants must decide in which direction the coherent dots are moving. A major factor in determining the difficulty of this perceptual decision is the fraction of coherently moving dots, which can be manipulated by the experimenter to achieve a fixed error rate during training using a procedure known as ‘staircasing’17.

We assume that agents make their decision on the basis of a scalar, subjective decision variable, h, which is computed from a stimulus that can be represented as a vector x (e.g., the direction of motion of all dots)

$$h = \Phi ({\mathbf{x}},{\boldsymbol{\phi }})$$
(1)

where Φ() is a function of the stimulus and (tunable) parameters ϕ. We assume that this transformation of stimulus x into the subjective decision variable h yields a noisy representation of the true decision variable, Δ (e.g., the fraction of dots moving left). That is, we write

$$h = \Delta + n$$
(2)

where the noise, n, arises due to the imperfect representation of the decision variable. We further assume that this noise, n, is random and sampled from a zero-mean Gaussian distribution with standard deviation σ (Fig. 1a).

If the decision boundary is set to 0, such that the model chooses option A when h > 0, option B when h < 0 and randomly when h = 0, then the noise in the representation of the decision variable leads to errors with probability

$${\mathrm{ER}} = {\int_{ - \infty }^0} p (h|\Delta ,\sigma )\mathrm{d}h = F( - \Delta /\sigma ) = F( - \beta \Delta )$$
(3)

where F(x) is the cumulative density function of the standardized noise distribution, p(x) = p(x|0, 1), and β = 1/σ quantifies the precision of the representation of Δ and the agent’s skill at the task. As shown in Fig. 1b, this error rate decreases as the decision gets easier (Δ increases) and as the agent becomes more accomplished at the task (β increases).

The goal of learning is to tune the parameters ϕ such that the subjective decision variable, h, is a better reflection of the true decision variable, Δ. That is, the model should aim to adjust the parameters ϕ so as to decrease the magnitude of the noise σ or, equivalently, increase the precision β. One way to achieve this tuning is to adjust the parameters using gradient descent on the error rate, i.e. changing the parameters over time t according to

$$\frac{{\mathrm{d}{\boldsymbol{\phi }}}}{{\mathrm{d}t}} = - \eta \nabla _{\boldsymbol{\phi }}{\mathrm{ER}}$$
(4)

where η is the learning rate and ϕER is the derivative of the error rate with respect to parameters ϕ. This gradient can be written in terms of the precision, β, as

$$\nabla _{\boldsymbol{\phi }}{\mathrm{ER}} = \frac{{\partial {\mathrm{ER}}}}{{\partial \beta }}\nabla _{\boldsymbol{\phi }}\beta$$
(5)

Note here that only the first term on the right hand side of Eq. (5) depends on the difficulty Δ, while the second describes how the precision changes with ϕ. Note also that Δ itself, as the ‘true’ decision variable, is independent of ϕ. This means that the optimal difficulty for training, that maximizes the change in the parameters, ϕ, at this time point, is the value of the decision variable Δ* that maximizes ER/∂β. Of course, this analysis ignores the effect of changing ϕ on the form of the noise—instead assuming that it only changes the scale factor, β, an assumption that likely holds in the relatively simple cases we consider here, although whether it holds in more complex cases will be an important question for future work.

In terms of the decision variable, the optimal difficulty changes as a function of precision (Fig. 1c) meaning that the difficulty of training must be adjusted online according to the skill of the agent. Using the monotonic relationship between Δ and ER (Fig. 1b) it is possible to express the optimal difficulty in terms of the error rate, ER* (Fig. 1d). Expressed this way, the optimal difficulty is constant as a function of precision, meaning that optimal learning can be achieved by clamping the error rate during training at a fixed value, which, for Gaussian noise is

$${\mathrm{ER}}^ \ast = \frac{1}{2}\left( {1 - {\mathrm{erf}}\left( {\frac{1}{{\sqrt 2 }}} \right)} \right) \approx 0.1587$$
(6)

That is, the optimal error rate for learning is 15.87%, and the optimal accuracy is around 85%. We call this the Eighty Five Percent Rule for optimal learning.

Dynamics of learning

While the previous analysis allows us to calculate the error rate that maximizes the rate of learning, it does not tell us how much faster learning occurs at this optimal error rate. In this section we address this question by comparing learning at the optimal error rate with learning at a fixed error rate, ERf (which may be suboptimal), and, alternatively, a fixed difficulty, Δf. If stimuli are presented one at a time (i.e., not batch learning), in both cases, gradient-descent based updating of the parameters, ϕ, (Eq. (4)) implies that the precision β evolves in a similar manner, i.e..

$$\frac{{\mathrm{d}\beta }}{{\mathrm{d}t}} = - \eta \frac{{\partial {\mathrm{ER}}}}{{\partial \beta }}$$
(7)

For fixed error rate, ERf, as shown in the Methods, integrating Eq. (7) gives

$$\beta (t) = \sqrt {\beta _0^2 + 2\eta K_{\mathrm{f}}(t - t_0)}$$
(8)

where t0 is the initial time point, β0 is the initial value of β and Kf is the following function of the training error rate

$$K_{\mathrm{f}} = - F^{ - 1}({\mathrm{ER}}_{\mathrm{f}})p(F^{ - 1}({\mathrm{ER}}_{\mathrm{f}}))$$
(9)

Thus, for fixed training error rate the precision grows as the square root of time with the exact rate determined by Kf which depends on both the training error rate and the noise distribution.

For fixed decision variable, Δf, integrating Eq. (7) is more difficult and the solution depends more strongly on the distribution of the noise. In the case of Gaussian noise, there is no closed form solution for β. However, as shown in the Methods, an approximate form can be derived at long times where we find that β grows as

$$\beta (t) \propto \sqrt {\log t}$$
(10)

i.e., exponentially slower than Eq. (38).

Simulations

To demonstrate the applicability of the Eighty Five Percent Rule we simulated the effect of training accuracy on learning in three cases, two from AI and one from computational neuroscience. From AI we consider how training at 85% accuracy impacts learning in the the simple case of a one-layer Perceptron14 with artificial stimuli, and in the more complex case of a two-layer neural network9 with stimuli drawn from the MNIST (Modified National Institute of Standards and Technology) dataset of handwritten digits18. From computational neuroscience we consider the model of Law and Gold11, that accounts for both the behavior and neural firing properties of monkeys learning the Random Dot Motion task. In all cases we see that learning is maximized when training occurs at 85% accuracy.

Perceptron with artificial stimuli

The Perceptron is a classic one-layer neural network model that learns to map multidimensional stimuli x onto binary labels, y via a linear threshold process14. To implement this mapping, the Perceptron first computes the decision variable h as

$$h = {\mathbf{w}}\cdot {\mathbf{x}}$$
(11)

where w are the weights of the network, and then assigns the label according to

$$y = \left\{ {\begin{array}{*{20}{c}} 1 & {h \, > \, 0} \\ 0 & {h \le 0} \end{array}} \right.$$
(12)

The weights, w, which constitute the parameters of the model, are updated based on feedback about the true label t by a the learning rule,

$${\mathbf{w}} \leftarrow {\mathbf{w}} + (t - y){\mathbf{x}}$$
(13)

This learning rule implies that the Perceptron only updates its weights when the predicted label y does not match the actual label t—that is, the Perceptron only learns when it makes mistakes. Naively then, one might expect that optimal learning would involve maximizing the error rate. However, because Eq. (13) is closely related (albeit not identical) to a gradient descent based rule (e.g., Chapter 39 in ref. 19), the analysis of the previous sections applies and the optimal error rate for training is 15.87%.

To test this prediction we simulated the Perceptron learning rule for a range of training error rates between 0.01 and 0.5 in steps of 0.01 (1000 simulations per error rate, 1000 trials per simulation). Error rate was kept constant by varying the difficulty, and the degree of learning was captured by the precision β (see Methods). As predicted by the theory, the network learns most effectively when trained at the optimal error rate (Fig. 2a) and the dynamics of learning are well described, up to a scale factor, by Eq. (38) (Fig. 2b).

Two-layer network with MNIST stimuli

As a more demanding test of the Eighty Five Percent Rule, we consider the case of a two-layer neural network applied to more realistic stimuli from the Modified National Institute of Standards and Technology (MNIST) dataset of handwritten digits18. The MNIST dataset is a labeled dataset of 70,000 images of handwritten digits (0 through 9) that has been widely used as a test of image classification algorithms (see ref. 20 for a list). The dataset is broken down into a training set consistent of 60,000 images and a test set of 10,000 images. To create binary classification tasks based on these images, we trained the network to classify the images according to either the parity (odd or even) or magnitude (less than 5 or not) of the number.

The network itself consisted of 1 input layer, with 400 units corresponding to the pixel values in the images, 1 hidden layer, with 50 neurons, and one output unit. Unlike the Perceptron, activity of the output unit was graded and was determined by a sigmoid function of the decision variable, h

$$y = \frac{1}{{1 + \exp \left( h \right)}} = S(h)$$
(14)

where the decision variable was given by

$$h = {\mathbf{w}}_2\cdot {\mathbf{a}}$$
(15)

where w2 were the weights connecting the hidden layer to the output units and a was the activity in the hidden layer. This hidden-layer activity was also determined by a sigmoidal function

$${\mathbf{a}} = S({\mathbf{w}}_1\cdot {\mathbf{x}})$$
(16)

where the inputs, x, corresponds to the pixel values in the image and w1 were the weights from the input layer to the hidden layer.

All weights were trained using the Backpropagation algorithm9 which takes the error,

$$e = t - y$$
(17)

and propagates it backwards through the network, from output to input stage, as a teaching signal for the weights. This algorithm implements stochastic gradient descent and, if our assumptions are met, should optimize learning at a training accuracy of 85%.

To test this prediction we trained the two-layer network for 5000 trials to perform either the Parity or the Magnitude Task while clamping the training error rate between 5 and 30% (Fig. 3). After training, performance was assessed on the entire test set and the whole process was repeated 1000 times for each task. As shown in Fig. 3, training error rate has a relatively large effect on test accuracy, around 10% between the best and worse training accuracies. Moreover, for both tasks, the optimal training occurs at 85% training accuracy. This suggests that the 85% rule holds even for learning of more realistic stimuli, by more complex multi-layered networks.

Biologically plausible model of perceptual learning

To demonstrate how the Eighty Five Percent Rule might apply to learning in biological systems, we simulated the Law and Gold model of perceptual learning11. This model has been shown to capture the long term changes in behavior, neural firing and synaptic weights as monkeys learn to perform the Random Dot Motion task.

Specifically, the model assumes that monkeys make the perceptual decision between left and right on the basis of neural activity in area MT—an area in the dorsal visual stream that is known to represent motion information15. In the Random Dot Motion task, neurons in MT have been found to respond to both the direction θ and coherence COH of the dot motion stimulus such that each neuron responds most strongly to a particular ‘preferred’ direction and that the magnitude of this response increases with coherence. This pattern of firing is well described by a simple set of equations (see “Methods”) and thus the noisy population response, x, to a stimulus of arbitrary direction and coherence is easily simulated.

From this MT population response, Law and Gold proposed that animals construct a decision variable in a separate area of the brain (lateral interparietal area, LIP) as the weighted sum of activity in MT; i.e.,

$$h = {\mathbf{w}}\cdot {\mathbf{x}} + \epsilon$$
(18)

where w are the weights between MT and LIP neurons and ϵ is random neuronal noise that cannot be reduced by learning. The presence of this irreducible neural noise is a key difference between the Law and Gold model (Eq. 18) and the Perceptron (Eq. 11) as it means that no amount of learning can lead to perfect performance. However, as shown in the Methods section, the presence of irreducible noise does not change the optimal accuracy for learning which is still 85%.

Another difference between the Perceptron and the Law and Gold model is the form of the learning rule. In particular, weights are updated according to a reinforcement learning rule based on a reward prediction error

$$\delta = r - E[r]$$
(19)

where r is the reward presented on the current trial (1 for a correct answer, 0 for an incorrect answer) and E[r] is the predicted reward

$$E[r] = \frac{1}{{1 + \exp ( - B|h|)}}$$
(20)

where B is a proportionality constant that is estimated online by the model (see “Methods”). Given the prediction error, the model updates its weights according to

$${\mathbf{w}} \leftarrow {\mathbf{w}} + \eta C\delta {\mathbf{x}}$$
(21)

where C is the choice (−1 for left, +1 for right) and η is the learning rate. Despite the superficial differences with the Perceptron learning rule (Eq. (13)) the Law and Gold model still implements stochastic gradient descent on the error rate13 and learning should be optimized at 85%.

To test this prediction we simulated the model at a variety of different target training error rates. Each target training rate was simulated 100 times with different parameters for the MT neurons (see “Methods”). The precision, β, of the trained network was estimated by fitting simulated behavior of the network on a set of test coherences that varied logarithmically between 1 and 100%. As shown in Fig. 4a the precision after training is well described (up to a scale factor) by the theory. In addition, in Fig. 4b, we show the expected difference in behavior—in terms of psychometric choice curves—for three different training error rates. While these differences are small, they are large enough that they could be distinguished experimentally.

Discussion

In this article we considered the effect of training accuracy on learning in the case of binary classification tasks and stochastic gradient-descent-based learning rules. We found that the rate of learning is maximized when the difficulty of training is adjusted to keep the training accuracy at around 85%. We showed that training at the optimal accuracy proceeds exponentially faster than training at a fixed difficulty. Finally we demonstrated the efficacy of the Eighty Five Percent Rule in the case of artificial and biologically plausible neural networks.

Our results have implications for a number of fields. Perhaps most directly, our findings move towards a theory for identifying the optimal environmental settings in order to maximize the rate of gradient-based learning. Thus the Eighty Five Percent Rule should hold for a wide range of machine learning algorithms including multilayered feedforward and recurrent neural networks (e.g. including ‘deep learning’ networks using backpropagation9, reservoir computing networks21,22, as well as Perceptrons). Of course, in these more complex situations, our assumptions may not always be met. For example, as shown in the Methods, relaxing the assumption that the noise is Gaussian leads to changes in the optimal training accuracy: from 85% for Gaussian, to 82% for Laplacian noise, to 75% for Cauchy noise (Eq. (31) in the “Methods”).

More generally, extensions to this work should consider how batch-based training changes the optimal accuracy, and how the Eighty Five Percent Rule changes when there are more than two categories. In batch learning, the optimal difficulty to select for the examples in each batch will likely depend on the rate of learning relative to the size of the batch. If learning is slow, then selecting examples in a batch that satisfy the 85% rule may work, but if learning is fast, then mixing in more difficult examples may be best. For multiple categories, it is likely possible to perform similar analyses, although the mapping between decision variable and categories will be more complex as will be the error rates which could be category specific (e.g., misclassifying category 1 as category 2 instead of category 3).

In Psychology and Cognitive Science, the Eighty Five Percent Rule accords with the informal intuition of many experimentalists that participant engagement is often maximized when tasks are neither too easy nor too hard. Indeed it is notable that staircasing procedures (that aim to titrate task difficulty so that error rate is fixed during learning) are commonly designed to produce about 80–85% accuracy17. Similarly, when given a free choice about the difficulty of task they can perform, participants will spontaneously choose tasks of intermediate difficulty levels as they learn23. Despite the prevalence of this intuition, to the best of our knowledge no formal theoretical work has addressed the effect of training accuracy on learning, a test of which is an important direction for future work.

More generally, our work closely relates to the Region of Proximal Learning and Desirable Difficulty frameworks in education24,25,26 and Curriculum Learning and Self-Paced Learning7,8 in computer science. These related, but distinct, frameworks propose that people and machines should learn best when training tasks involve just the right amount of difficulty. In the Desirable Difficulties framework, the difficulty in the task must be of a ‘desirable’ kind, such as spacing practice over time, that promotes learning as opposed to an undesirable kind that does not. In the Region of Proximal Learning framework, which builds on early work by Piaget27 and Vygotsky28, this optimal difficulty is in a region of difficulty just beyond the person’s current ability. Curriculum and Self-Paced Learning in computer science build on similar intuitions, that machines should learn best when training examples are presented in order from easy to hard. In practice, the optimal difficulty in all of these domains is determined empirically and is often dependent on many factors29. In this context, our work offers a way of deriving the desired difficulty and the region of proximal learning in the special case of binary classification tasks for which stochastic gradient-descent learning rules apply. As such our work represents the first step towards a more mathematical instantiation of these theories, although it remains to be generalized to a broader class of circumstances, such as multi-choice tasks and different learning algorithms.

With regard to different learning algorithms, it is important to note that not all models will exhibit a sweet spot of difficulty for learning. As an example, consider how a Bayesian learner with a perfect memory would infer parameters ϕ by computing the posterior distribution given past stimuli, x1:t, and labels, y1:t,

$$\begin{array}{r}p({\boldsymbol{\phi }}|{\mathbf{x}}_{1:t},y_{1:t}) \propto p(y_{1:t}|{\boldsymbol{\phi }},{\mathbf{x}}_{1:t})p({\boldsymbol{\phi }})\\ = \mathop {\prod}\limits_{i = 1}^t p (y_i|{\boldsymbol{\phi }},{\mathbf{x}}_i)p({\boldsymbol{\phi }})\end{array}$$
(22)

where the last line holds when the label depends only on the current stimulus. Clearly this posterior distribution over parameters is independent of the ordering of the trials meaning that a Bayesian learner (with perfect memory) would learn equally well if hard or easy examples are presented first. This is not to say that Bayesian learners cannot benefit from carefully constructed training sets, but that for a given set of training items the order of presentation has no bearing on what is ultimately learned. This contrasts markedly with gradient-based algorithms, many of which try to approximate the maximum a posteriori solution of a Bayesian model, whose training is order dependent and whose learning is optimized with ER/∂β.

Finally, we note that our analysis for maximizing the gradient, ER/∂β, not only applies to learning but to any process that affects the precision of neural representations, such as attention, engagement, or more generally cognitive control30,31. For example, attention is known to improve the precision with which sensory stimuli are represented in the brain, e.g., ref. 32. If exerting control leads to a change in precision of δβ, then the change in error rate associated with exerting this control is

$$\delta {\mathrm{ER}} = \frac{{\partial {\mathrm{ER}}}}{{\partial \beta }}\delta \beta$$
(23)

This predicts that the benefits of engaging cognitive control should be maximized when ER/∂β is maximized, that is at ER*. More generally this relates to the Expected Value of Control theory30,31,33 which suggests that the learning gradient, ER/∂β, is monitored by control-related areas of the brain such as anterior cingulate cortex.

Along similar lines, our work points to a mathematical theory of the state of ‘Flow’34. This state, ‘in which an individual is completely immersed in an activity without reflective self-consciousness but with a deep sense of control’ [ref. 35, p. 1], is thought to occur most often when the demands of the task are well matched to the skills of the participant. This idea of balance between skill and challenge was captured originally with a simple conceptual diagram (Fig. 5) with two other states: ‘anxiety’ when challenge exceeds skill and ‘boredom’ when skill exceeds challenge. These three qualitatively different regions (flow, anxiety, and boredom) arise naturally in our model. Identifying the precision, β, with the level of skill and the level challenge with the inverse of true decision variable, 1/Δ, we see that when challenge equals skill, flow is associated with a high learning rate and accuracy, anxiety with low learning rate and accuracy and boredom with high accuracy but low learning rate (Fig. 5b, c). Intriguingly, recent work by Vuorre and Metcalfe, has found that subjective feelings of Flow peaks on tasks that are subjectively rated as being of intermediate difficulty36. In addition work on learning to control brain computer interfaces finds that subjective, self-reported measures of ‘optimal difficulty’, peak at a difficulty associated with maximal learning, and not at a difficulty associated with optimal decoding of neural activity37. Going forward, it will be interesting to test whether these subjective measures of engagement peak at the point of maximal learning gradient, which for binary classification tasks is 85%.

Methods

Optimal error rate for learning

In order to compute the optimal difficulty for training, we need to find the value of Δ that maximizes the learning gradient, ER/∂β. From Eq. (3) we have

$$\frac{{\partial {\mathrm{ER}}}}{{\partial \beta }} = \Delta p( - \beta \Delta )$$
(24)

From here the optimal difficulty, Δ*, can be found by computing the derivative of the gradient with respect to Δ, i.e.,

$$\frac{\partial }{{\partial \Delta }}\frac{{\partial {\mathrm{ER}}}}{{\partial \beta }} = - \frac{\partial }{{\partial \Delta }}\left( {\Delta p( - \beta \Delta )} \right)\\ = - p( - \beta \Delta ) + \beta \Delta \left. {\frac{{\partial p(x)}}{{\partial x}}} \right|_{x = - \beta \Delta }$$
(25)

Setting this derivative equal to zero gives us the following expression for the optimal difficulty, Δ*, and error rate, ER*

$$\beta \Delta ^ \ast = \frac{{p( - \beta \Delta ^ \ast )}}{{p\prime ( - \beta \Delta ^ \ast )}}\quad {\mathrm{and}}\quad {\mathrm{ER}}^ \ast = F( - \beta \Delta ^ \ast )$$
(26)

where p′(x) denotes the derivative of p(x) with respect to x. Because β and Δ* only ever appear together in these expressions, Eq. (26) implies that βΔ* is a constant. Thus, while the optimal difficulty, Δ*, changes as a function of precision (Fig. 1c), the optimal training error rate, ER* does not (Fig. 1d). That is, training with the error rate clamped at ER* is guaranteed to maximize the rate of learning.

The exact value of ER* depends on the distribution of noise, n, in Eq. (2). In the case of Gaussian noise, we have

$$p(x) = \frac{1}{{\sqrt {2\pi } }}\exp \left( { - \frac{{x^2}}{2}} \right)$$
(27)

which implies that

$$\frac{{p(x)}}{{p\prime (x)}} = - \frac{1}{x}$$
(28)

and that the optimal difficulty is

$$\Delta ^ \ast = \beta ^{ - 1}$$
(29)

Consequently the optimal error rate for Gaussian noise is

$${\mathrm{ER}}^ \ast = \frac{1}{2}\left( {1 - {\mathrm{erf}}\left( {\frac{1}{{\sqrt 2 }}} \right)} \right) \approx 0.1587$$
(30)

Similarly for Laplacian noise ($$p(x) = \frac{1}{2}\exp ( - |x|)$$) and Cauchy noise (p(x) = (1 + x2))−1) we have optimal error rates of

$$\begin{array}{l}{\mathrm{ER}}_{{\mathrm{Laplace}}}^ \ast = \frac{1}{2}\exp ( - 1) \approx 0.1839\\ {\mathrm{ER}}_{{\mathrm{Cauchy}}}^ \ast = \frac{1}{\pi }\arctan ( - 1) + \frac{1}{2} = 0.25\end{array}$$
(31)

Optimal learning with endogenous noise

The above analyses for optimal training accuracy also applies in the case where the decision variable, h, is corrupted by endogenous, irreducible noise, ϵ, in addition to representation noise, n, that can be reduced by learning; i.e.,

$$h = \Delta + n + \epsilon$$
(32)

In this case we can split the overall precision, β, into two components, one based on representational uncertainty that can be reduced, βn, and another based on endogenous uncertainty that cannot, βϵ. For Gaussian noise, these precisions are related to each other by

$$\frac{1}{{\beta ^2}} = \frac{1}{{\beta _n^2}} + \frac{1}{{\beta _\epsilon ^2}}$$
(33)

More generally, the precisions are related by some function, G, such that β = G(βn, βϵ). Since only n can be reduced by learning, it makes sense to perform stochastic gradient descent on βn such that the learning rule should be

$$\frac{{d\beta _n}}{{dt}} = - \eta \frac{{\partial {\mathrm{ER}}}}{{\partial \beta _n}}\\ = - \eta \frac{{\partial {\mathrm{ER}}}}{{\partial \beta }}\frac{{\partial \beta }}{{\partial \beta _n}}$$
(34)

Note that ∂β/∂βn is independent of Δ so maximizing learning rate w.r.t. Δ means maximizing ER/∂β as before. This implies that the optimal training difficulty will be the same, e.g., 85% for Gaussian noise, regardless whether endogenous noise is present or not.

Dynamics of learning

To calculate the dynamics of learning we need to integrate Eq. (7) over time. This, of course depends on the learning gradient, ER/∂β, which varies depending on the noise and whether the error rate or the true decision variable is fixed during training.

In the fixed error rate case, we fix the error rate during training to ERf. This implies that the difficulty should change over time according to

$$\Delta (t) = - \frac{1}{{\beta (t)}}F^{ - 1}({\mathrm{ER}}_{\mathrm{f}})$$
(35)

where F−1() is the inverse cdf. This implies that β evolves over time according to

$$\frac{{d\beta }}{{dt}} = - \eta \frac{{\partial {\mathrm{ER}}}}{{\partial \beta }}\\ = \, \eta \Delta (t)p( - \beta \Delta (t))\\ = - \frac{\eta }{{\beta (t)}}F^{ - 1}({\mathrm{ER}}_{\mathrm{f}})p(F^{ - 1}({\mathrm{ER}}_{\mathrm{f}}))\\ = \frac{{\eta K_{\mathrm{f}}}}{{\beta (t)}}$$
(36)

where we have introduced Kf as

$$K_{\mathrm{f}} = - F^{ - 1}({\mathrm{ER}}_{\mathrm{f}})p(F^{ - 1}({\mathrm{ER}}_{\mathrm{f}}))$$
(37)

Integrating Eq. (36) and solving for β(t) we get

$$\beta (t) = \sqrt {\beta _0^2 + 2\eta K_{\mathrm{f}}(t - t_0)}$$
(38)

where t0 is the initial time point, and β0 is the initial value of β. Thus, for fixed error rate the precision grows as the square root of time with the rate determined by Kf which depends on both the training error rate and the noise distribution. For the optimal error rate we have, Kf = p(−1).

In the fixed decision variable case, the true decision variable is fixed at Δf and the error rate varies as a function of time. In this case we have

$$\frac{{d\beta }}{{dt}} = - \eta \frac{{\partial {\mathrm{ER}}}}{{\partial \beta }} = \Delta _{\mathrm{f}}p( - \beta \Delta _{\mathrm{f}})$$
(39)

Formally, this can be solved as

$${\int_{\beta _0}^\beta} {\frac{1}{{p( - \beta \Delta _{\mathrm{f}})}}} d\beta = \Delta _{\mathrm{f}}(t - t_0)$$
(40)

However, the exact form for β(t) will depend on p(x).

In the Gaussian case we cannot derive a closed form expression for β(t). The closest we can get is to write

$${\int_0^{\frac{{\beta {\mathrm{\Delta }}_{\mathrm{f}}}}{{\sqrt 2 }}}} {\exp } (x^2)\mathrm{d}x = {\int_0}^{\frac{{\beta _0{\mathrm{\Delta }}_{\mathrm{f}}}}{{\sqrt 2 }}} {\exp } (x^2)\mathrm{d}x + \frac{{{\mathrm{\Delta }}^2}}{{2\sqrt \pi }}(t - t_0)$$
(41)

For long times, and large β, we can write

$${\int_0}^{\frac{{\beta {\mathrm{\Delta }}_{\mathrm{f}}}}{{\sqrt 2 }}} {\exp } (x^2)\mathrm{d}x < \exp \left( {\frac{{\beta ^2{\mathrm{\Delta }}_{\mathrm{f}}^2}}{2}} \right)$$
(42)

which implies that for long times β grows slower than $$\sqrt {\log t}$$, which is exponentially slower than the fixed error rate case.

In contrast to the Gaussian case, the Laplacian case lends itself to closed form analysis and we can derive the following expression for β

$$\beta = \frac{1}{{{\mathrm{\Delta }}_{\mathrm{f}}}}\log \left( {\exp (\beta _0{\mathrm{\Delta }}_{\mathrm{f}}) + \frac{1}{2}\eta {\mathrm{\Delta }}_{\mathrm{f}}^2(t - t_0)} \right)$$
(43)

Again this shows logarithmic dependence on t indicating that learning is much slower with a fixed difficulty.

In the case of Cauchy noise we can compute the integral in Eq. (40) and find that β is the root of the following equation

$$\frac{{{\mathrm{\Delta }}_{\mathrm{f}}}}{3}\beta ^3 + \beta = \frac{{{\mathrm{\Delta }}_{\mathrm{f}}}}{3}\beta _0^3 + \beta _0 + \frac{{{\mathrm{\Delta }}_{\mathrm{f}}}}{\pi }(t - t_0)$$
(44)

For long training times this implies that β grows as the cube root of t. Thus in the Cauchy case, while the rate of learning is still greatest at the optimal difficulty, the improvement is not as dramatic as in the other cases.

Application to the perceptron

To implement the Perceptron example, we assumed that true labels t were generated by a ‘Teacher Perceptron’38 with normalized weight vector, e. Learning was quantified by decomposing the learned weights w into two components: one proportional to e and a second orthogonal to e, i.e.,

$${\mathbf{w}} = |{\mathbf{w}}|\left( {{\mathbf{e}}\cos \theta + {\mathbf{e}}_ \bot \sin \theta } \right)$$
(45)

where θ is the angle between w and e, and e is the unit vector perpendicular to e in the plane defined by e and w. This allows us to write the decision variable h in terms of signal and noise components as

$$h = |{\mathbf{w}}|\left( {({\mathbf{e}}\cdot {\mathbf{x}})\cos \theta + ({\mathbf{e}}_ \bot \cdot {\mathbf{x}})\sin \theta } \right)\\ = \underbrace {|{\mathbf{w}}|(2t - 1){\mathrm{\Delta }}\cos \theta }_{{\mathrm{signal}}} + \underbrace {|{\mathbf{w}}|({\mathbf{e}}_ \bot \cdot {\mathbf{x}})\sin \theta }_{{\mathrm{noise}}}$$
(46)

where the difficulty Δ = |ex| is the distance between x and the decision boundary, and the (2t − 1) term simply controls which side of the boundary x is on. This implies that the precision β is proportional to cot θ, with a constant of proportionality determined by the dimensionality of x.

In the case where the observations x are sampled from distributions that obey the central limit theorem, then the noise term is approximately Gaussian implying that the optimal error rate for training the Perceptron, ER* = 15.87%.

To test this prediction we simulated the Perceptron learning rule for a range of training error rates between 0.01 and 0.5 in steps of 0.01 (1000 simulations per error rate). Stimuli, x, were 100 dimensional and independently sampled from a Gaussian distribution with mean 0 and variance 1. Similarly, the true weights e were sampled from a mean 0, variance 1 Gaussian. To mimic the effect of a modest degree of initial training, we initialized the weight vector w randomly with the constraint that |θ| < 1.6π. The difficulty Δ was adjusted on a trial-by-trial basis according to

$${\mathrm{\Delta }} = F^{ - 1}({\mathrm{ER}})\lambda \tan \theta$$
(47)

which ensures that the training error rate is clamped at ER. The degree of learning was captured by the precision β.

Application to the two-layer neural network

To implement the two-layer network, we built a sigmoidal neural network with one hidden layer (of 50 neurons) and one output neuron. The weights between the input layer and the hidden layer and between the hidden layer and output layer were trained using the standard Backpropagation algorithm.

In order to clamp the error rate during training we first had to rate the images according to their ‘difficulty’. To this end, we trained a teacher network with the same basic architecture (i.e., 50 hidden units and 1 output unit) until its performance was near perfect (training error rate = 99.6% for the Parity Task and 99.4% for the Magnitude Task; test error rate = 97% for the Magnitude Task and 95.6% for the Parity Task). We then used the absolute value of the decision variable from this network, |hteacher| as a proxy for the true difficulty, Δ—with larger values of |hteacher| indicating easier stimuli to classify.

Weights in the network were initialized randomly from a Gaussian distribution (mean 0, variance 1). To achieve a fixed error rate during training, on each trial, we selected a stimulus that was closest to a target difficulty, htarget. This target difficulty was adjusted based on the performance of the network during training—increasing if the network classified the stimulus incorrectly, and decreasing if the network classified the stimulus correctly. More specifically, the target difficulty was adjusted as

$$h^{\mathrm{target}} \leftarrow h^{{\mathrm{target}}} + D\left( {A^{{\mathrm{target}}} - A^{{\mathrm{av}}}} \right)$$
(48)

where D is the step size (=1), Atarget is the target training accuracy and Aav is the running average of the accuracy from the last 50 trials.

On each trial we selected the ‘eligible’ stimulus whose value of hteacher was closest to htarget. To ensure that a given stimulus was not selected too often during training, stimuli were only eligible to be chosen if they had not been used in the last 50 trials.

Each initial state of the network was trained on either the Parity or Magnitude Task at a fixed training error rate between 5 and 30% in steps of 5%. At the end of training performance was assessed on the whole test set. This process was repeated 1000 times, with a new set of initial random weights each time.

Application to Law and Gold model

The model of perceptual learning follows the exposition in Law and Gold11. To aid comparison with that paper we retain almost all of their notation, with the three exceptions being their β parameter, which we rename as B to avoid confusion with the precision, their ϕi parameter which we rename as Fi to avoid confusion with the parameters of the learner, and their learning rate parameter α which we write as η.

Following Law and Gold11, the average firing rate of an MT neuron, i, in response to a moving dot stimulus with direction θ and coherence COH is

$$m_i = T(k_i^0 + {\mathrm{COH}}(k_i^n + (k_i^p - k_i^n)f(\theta |\Theta _i)))$$
(49)

where T is the duration of the stimulus, $$k_i^0$$ is the response of neuron i to a zero-motion coherence stimulus, $$k_i^p$$ is the response to a stimulus moving in the preferred direction and $$k_i^n$$ is the response to a stimulus in the null direction. f(θi) is the tuning curve of the neuron around its preferred direction Θi

$$f(\theta |\Theta _i) = \exp \left( { - \frac{{(\theta - \Theta _i)^2}}{{2\sigma _\theta ^2}}} \right)$$
(50)

where σθ (=30 degrees) is the width of the tuning curve which is assumed to be identical for all neurons.

Neural activity on each trial was assumed to be noisily distributed around this mean firing rate. Specifically the activity, xi, of each neuron is given by a rectified (to ensure xi > 0) sample from a Gaussian with mean mi and variance vi

$$v_i = F_im_i$$
(51)

where Fi is the Fano factor of the neuron.

Thus each MT neuron was characterized by five free parameters. These free parameters were sampled randomly for each neuron such that $$\theta _i\sim U( - 180,180)$$, $$k_i^0\sim U(0,20)$$, $$k_i^p\sim U(0,50)$$, $$k_i^n\sim U( - k_i^0,0)$$ and $$F_i\sim U(1,5)$$. Note that $$k_i^n$$ is set between −$$k_i^0$$ and 0 to ensure that the minimum average firing rate never dips below zero. Each trial was defined by three task parameters: T = 1 s, Θ = ±90 degrees and COH which was adjusted based on performance to achieve a fixed error rate during training (see below). As in the original paper, the number of neurons was set to 7200 and the learning rate, η was 10−7.

The predicted reward E[r] was computed according to Eq. (20). In line with Law and Gold (Supplementary Fig. 2 in ref. 11), the proportionality constant B was computed using logistic regression on the accuracy and absolute value of the decision variable, |h|, from last L trials, where L = min(300, t).

In addition to the weight update rule (Eq. (21)), weights were normalized after each update to keep the sum of the squared weights, $$\mathop {\sum}\limits_i {w_i^2} = w_{\mathrm{amp}}$$ a constant (=0.02). While this normalization has only a small overall effect (see Supplementary Material in ref. 11), we replicate this weight normalization here for consistency with the original model.

To initialize the network, the first 50 trials of the simulation had a fixed coherence COH = 0.9. After this initialization period, the coherence was adjusted according to the difference between the target accuracy, Atarget, and actual accuracy in the last L trials, AL, where L = min(300, t). Specifically, the coherence on trial t was set as

$${\mathrm{COH}}_t = \frac{1}{{1 + \exp ( - \Gamma _t)}}$$
(52)

where Γt was adjusted according to

$$\Gamma _{t + 1} = \Gamma _t + \mathrm{d}\Gamma (A_{{\mathrm{target}}} - A_L)$$
(53)

and dΓ was 0.1.

To estimate the post-training precision parameter, β, we simulated behavior of the trained network on a set of 20 logarithmically spaced coherences between 10−3 and 1. Behavior at each coherence was simulated 100 times and learning was disabled during this testing phase. The precision parameter, β, was estimated using logistic regression between accuracy on each trial (0 or 1) and coherence; i.e.,

$${\mathrm{ACC}}\sim \frac{1}{{1 + \exp ( - \beta \times {\mathrm{COH}})}}$$
(54)