## Main

Humans routinely infer relational structure from local comparisons. For instance, learning that boxer Muhammad Ali defeated George Foreman can let us infer that Ali would probably win against other boxers that Foreman had defeated. More formally, generalizing from relational observations to new, unobserved relations (for example, knowing that A > B and B > C leads to A > C) is commonly referred to as transitive inference1,2,3,4. Transitive inference is not a uniquely human capacity5 but can also be observed in non-human primates6,7,8, rats9 and birds10,11,12.

In the laboratory, transitive inference can be observed after teaching participants the relations between neighbouring elements from an ordered set of arbitrary stimuli (Fig. 1a). The neighbour relations are typically taught through pairwise choice feedback (Fig. 1b) where the relational information is deterministic (that is, if A > B, in our sporting analogy, A would never lose a match against B). Various theories have been proposed to describe how observers accomplish transitive inferences of non-neighbour relations (for example, A > D) in such settings. One class of models posits that observers learn implicit value representations for each individual element (A, B, C and so on), which then enables judgements of arbitrary pairings3,13,14. Alternatively, transitive inference could be accomplished through more explicit, hippocampus-based memory processes15,16,17,18, which we will return to below.

Before turning to transitive inference, we consider relational learning in a full-feedback scenario (Fig. 1b) where choice feedback is provided for every possible pairing of items, such that no transitive inference is required. We model implicit value learning in this setting through a simple reinforcement learning (RL) mechanism (Q-learning; Methods) by which relational feedback (for example, ‘correct’ when selecting A over B) may increase the perceived value (Q) of item A and decrease that of item B (Model Q1, Fig. 2a). In this simple RL model, relational feedback symmetrically updates (with opposite signs) the value estimates for both items in a pair. For instance, if Muhammad Ali beat George Foreman, it seems rational to attribute this outcome to Ali’s greater skill as much as to Foreman’s deficit. We show in simulations that symmetric value updating is in fact optimal in the full-feedback setting. An alternative model with asymmetric learning rates (α+ ≠ α) applied to the winner and loser in a pair (Model Q2; ‘2’ denotes dual learning rates) learns worse than the symmetric model (Q1) where α+ = α (Fig. 2b,c). Implicit value learning generally gives rise to a ‘symbolic distance effect’1,19,20, where nearby elements are less discriminable (due to more similar value estimates) than elements with greater ordinal distance14,21.

Next, we turn to a partial-feedback setting, which is the typical transitive inference scenario, with feedback being provided only for pairs of items with neighbouring values (Fig. 1b). Here, the simple RL models (Q1 and Q2) effectively learn about stimuli only at the extremes of the ordered set (for example, A and H; Extended Data Fig. 1a), since these are statistically more likely to be winners or losers than their neighbours (under uniform sampling). No value learning occurs for intermediate items (stimuli B to G), since these are equally likely to be paired with lower- and higher-valued stimuli3. However, the model can easily be adapted to performing transitive inference when extending it with a simple assumption: value updates should scale with the difference between the estimated item values, Q(A) − Q(B) (for similar approaches, see refs. 14,21,22). More specifically, to the extent that A is already higher valued than B, observing the expected outcome A > B should induce weaker value updates, whereas the unexpected outcome A < B should induce stronger updates. To illustrate, observing an unknown amateur boxer win against a world champion should induce stronger changes in belief than the opposite, less surprising result (champion > amateur). When we incorporate this simple assumption into our model (Model Q1*), it learns orderly structured values, Q(A) > Q(B) > … > Q(H), and thus accomplishes transitive inferences for all pairs of items (Fig. 2d; see also Supplementary Video for an illustration of how our Q-learning models accomplish transitive learning). We also observe a symbolic distance effect with this type of learning under partial feedback, similar to what we observed with simple RL under full feedback (Fig. 2a,d).

Notably, the effect of asymmetric learning rates (α+ ≠ α, Model Q2*) under partial feedback is strikingly different from what we observed with full feedback. Under partial feedback, optimal performance is achieved with a strongly asymmetric learning policy (α+α or α+α), where only the winner (or loser) in a pair is updated (Fig. 2e,f and Supplementary Video). In other words, in a setting where hidden relational structure is inferred from only local comparisons, it is surprisingly beneficial to ignore losers (or winners) in outcome attribution. Of note, the winner/loser asymmetry outlined here differs from, and is orthogonal to, previously described asymmetries in learning from positive/negative23,24,25,26 or (dis-)confirmatory outcomes27,28. A noteworthy aspect of our Model Q2* is that the superior, asymmetric learning policy results in a compression of the observer’s latent value structure (Fig. 2f). Selective updating therefore naturally gives rise to diminishing sensitivity towards larger values, as is universally observed in psychophysics29, numerical cognition30,31 and behavioural economics32.

Going beyond typical studies of transitive inference with deterministic outcomes, we examined whether our simulation results generalize to scenarios where relational outcomes can be variable, as is the case in many real-world domains such as sports, stock markets and social hierarchies. To this end, we added random variance to the comparison outcomes such that, for example, an item won over its lower-valued neighbour in approximately 80% of cases but lost in the other 20% (see Methods for the details). Intuitively, we allowed for the possibility that competitor A may sometimes lose against B, even if A is generally stronger. We found that our simulation results held for such probabilistic environments, just as they did for deterministic scenarios (Extended Data Fig. 2).

In the models discussed so far, the observer associates each individual item (A, B, C and so on) with an implicit value (Q(A), Q(B), Q(C), … ; item-level learning). An alternative strategy is to more directly learn response preferences for each individual item pairing (pA>B, pB>C, … ; pair-level learning; Methods). For instance, in our partial-feedback setting (Fig. 1), observers might learn to choose A when comparing A and B, to choose B when comparing B and C, and so forth, even without relying on value estimates for the individual items. In its simplest form, such memory for pairwise preferences (Model P) only allows learning of pair relations that have been directly experienced (that is, only neighbouring pairs in our partial-feedback setting; Extended Data Fig. 1b, left). However, the pair-level memory can also be extended to allow for transitive inference of more distant, never experienced relations8,33,34: when asked to judge, for example, A versus C, observers might ‘chain together’ memories of the linking neighbour preferences (pA>B and pB>C) through associative recall17,35 or spreading activation36 to infer a transitive preference (pA>C; Model Pi; Extended Data Fig. 1b, right). Transitive inference based on such pair-level learning gives rise to an inverse symbolic distance effect (Extended Data Fig. 1b, right), where nearby pairs are more discriminable than more distant pairs, reflecting the high dimensionality of the underlying associative memory structure. In modelling our empirical data, we allow for item-level value learning (models denoted by a Q), pair-level learning (models denoted by a P) and a combination of both, in explaining human transitive inference.

## Results

We report the results of four experiments (n = 145) where we varied whether feedback was full or partial and whether it was probabilistic or deterministic (Methods). In all experiments, the participants were shown a pair of items (drawn from a set of eight) on each trial and were asked to make a relational choice (Fig. 1). The participants were given no prior knowledge about item values and could learn only through trial-and-error feedback.

### Full feedback

In Experiment 1 (Exp. 1; n = 17), probabilistic choice feedback (Methods) was provided after each of 448 sequential pair comparisons (‘full feedback’). Figure 3a shows the mean proportions of correctly choosing the higher-valued item, averaged over all trials in Exp. 1. Descriptively, the choice matrix is dominated by a symbolic distance effect, as predicted by implicit value learning. Fitting our item-level learning models (Q1, Q2, Q1* and Q2*), the best fit to the data is provided by the simplest model (Q1), with a single learning rate for winners and losers (Fig. 3c,e; protected exceedance probability, pxp(Q1) > 0.99; mean Bayesian information criterion (BIC), 361.79 ± 24.68 s.e.m.). In other words, participant behaviour was consistent with a symmetrical updating policy, which our simulations showed to be optimal in the full-feedback setting.

### Partial feedback

In Exps. 2–4, choice feedback was provided only on neighbour pairs (‘partial feedback’) to study transitive inference. In these experiments, we increased the frequency at which the participants were shown neighbouring pairs relative to non-neighbouring pairs to provide more learning opportunities, since the task is inherently harder. We verified that our simulation results were invariant to this modification (Extended Data Fig. 3). Otherwise, the design of Exp. 2 (n = 31) was identical to that of Exp. 1. Experiment 3 (n = 48) was an online replication of Exp. 2, where the pair items on each trial were shown side by side instead of sequentially. Experiment 4 (n = 49) was similar to Exp. 3, but the feedback was made deterministic (100% truthful), as in previous studies of transitive inference (see Methods for details on the individual experiments).

The choice data from each of the partial-feedback experiments (Exps. 2–4, Fig. 3b) showed clear evidence for transitive inference, with above-chance performance for non-neighbouring pairs that never received feedback (mean accuracy averaged over non-neighbour trials: Exp. 2, 0.714 ± 0.028; Exp. 3, 0.698 ± 0.018; Exp 4, 0.709 ± 0.019; Wilcoxon signed-rank tests against chance level (0.5): all P < 0.001, all r > 0.84; see Supplementary Table 1 for the details). Furthermore, the grand mean choice matrices showed the following descriptive characteristics: (1) a symbolic distance effect similar to that observed with full feedback, (2) an asymmetry with greater discriminability of lower-valued items and (3) relatively increased discriminability of neighbour pairs.

The modelling results for the partial-feedback experiments are summarized in Fig. 3d,f (see also Extended Data Figs. 48). We highlight two main findings. First, the partial-feedback data were better described by asymmetric models with different learning rates for winners and losers. This held true at every level of model complexity, with our asymmetric models (Q2, Q2*, Q2* + P and Q2* + Pi) always performing better than their symmetric counterparts (Q1, Q1*, Q1* + P and Q1* + Pi; Wilcoxon signed-rank tests comparing BICs, Exps. 2–4 combined: all P < 0.001, all r > 0.35; see Supplementary Table 2 for details), and regardless of whether the partial feedback was probabilistic (Exps. 2 and 3) or deterministic (Exp. 4; comparison of mean BICs between asymmetric and symmetric models: all P < 0.001, all r > 0.67; see Supplementary Table 3 for the details). In other words, the participants adopted an asymmetric learning policy, which proved superior in our model simulations (Fig. 2e).

Second, behaviour in the partial-feedback scenario was not fully described by item-level value learning alone. The winning model in Exps. 2 and 4 (Q2* + P; pxp, 0.81 and 0.39; mean BIC, 609.15 ± 12.77 and 434.27 ± 6.29) incorporated pair-level learning in addition to the value estimates of the individual items. This pair-level memory (+P; ‘Models’) accounts for the increased performance for neighbouring pairs (Fig. 3b, first off-diagonals; see also Extended Data Fig. 1b). In Exp. 3, the model comparison was less clear, with model Q2* showing the highest pxp (0.41) but model Q2* + P providing a better average fit in terms of BIC (676.86 ± 19.40 versus 692.09 ± 8.61; Wilcoxon signed-rank test: P < 0.001; z = −4.53; n = 48; r = 0.48; 95% confidence interval (CI), 0.29 to 0.63). However, we found no evidence that pair-level memory contributed to transitive inference in our experiments. Incorporating associative recall of ‘linking’ neighbour pairs (+Pi) worsened the model fits, in terms of both pxp (all pxp < 0.07) and BIC (Exps. 2–4 combined; Q2* + Pi, 570.42 ± 15.72; compared with Q2* + P, 567.60 ± 15.38; Wilcoxon signed-rank test: P < 0.001; z = −6.02; n = 128; r = 0.53; 95% CI, 0.37 to 0.69), which is in line with the absence of an ‘inverse’ symbolic distance effect (Extended Data Fig. 1b, right) in the empirical choice data (Fig. 3b).

Figure 4a illustrates how learning of non-neighbour comparisons in our experiments evolved over time. The value compression implied by asymmetric learning of winners (Fig. 2e) predicts relatively better performance for lower-valued pairs (for example, F–H) than for higher-valued pairs (for example, A–C; Fig. 4b, right). We observed no such pattern in Exp. 1 with full feedback (Fig. 4a–c, left). In contrast, participants in Exps. 2–4 with partial feedback showed the critical pattern early on (Fig. 4a–c, right), as predicted by our asymmetric learning models (Fig. 4b, right). Turning to neighbouring pairs (Fig. 4d), which could additionally benefit from pair-level learning (+P; see above), our asymmetric model (Q2* + P) predicts only a modest decline in accuracy for higher-valued pairs (see also Fig. 3d), which also matched the empirical data (Fig. 4d, right).

The winner/loser asymmetries described so far might be explained by alternative learning biases, such as asymmetric learning weights for chosen versus unchosen items37. Given above-chance performance, the chosen item will statistically be more likely to be the winning item. To test this alternative explanation, we repeated our modelling analyses using separate learning rates (α+/α) for chosen/unchosen items instead of for the winning/losing item (Methods and equation (4)). This alternative model fit our partial-feedback data significantly worse (mean BIC collapsed across Exps. 2–4, 587.21 ± 15.91 versus 567.60 ± 15.38; Wilcoxon signed-rank test: P < 0.001; z = −6.09; n = 128; r = 0.54; 95% CI, 0.40 to 0.65), corroborating our interpretation that transitive inference learning was better characterized by asymmetries between winners and losers. Previous RL studies have also highlighted potential differences in learning from positive (confirmatory) as opposed to negative (disconfirmatory) feedback24,25,28. Extending our winning model to incorporate such confirmation bias (Extended Data Fig. 9) improved the overall model fit (mean BIC collapsed across Exps. 2–4, 537.91 ± 14.45 versus 567.60 ± 15.38; Wilcoxon signed-rank test: P < 0.001; z = −8.32; n = 128; r = 0.72; 95% CI, 0.67 to 0.81), which is consistent with previous findings in other learning contexts24,25,28. However, the addition of confirmation bias left the finding of winner/loser asymmetries unchanged (Extended Data Fig. 9 and Fig. 3f), thus illustrating the robustness of our results.

We also compared our model family against two previous models of transitive inference (Supplementary Methods): a classic value-transfer model (VAT)21 and a more recent model based on ranking algorithms used in competitive sports such as chess (RL-ELO)22. Both VAT and RL-ELO were outperformed by our winning model Q2* + P when fitted to our partial-feedback data (Exps. 2–4 combined; mean BIC VAT, 606.59 ± 14.04; RL-ELO, 617.96 ± 15.07; Q2* + P, 567.60 ± 15.38; Wilcoxon signed-rank tests versus Q2* + P: both P < 0.001, both r > 0.65). This held true even when we modified VAT and RL-ELO to include pair-level learning (+P) and separate learning rates for winners and losers (mean BIC, 572.95 ± 15.17 and 576.34 ± 15.94, respectively; both P < 0.02, both r > 0.21; see Supplementary Table 4 for detailed model comparison results). Our asymmetric Q-learning process thus explains the experimental data better than these earlier models of transitive inference.

Our model simulations (Fig. 2e) indicate two aspects of asymmetric learning that are not directly evident from the group-level results shown in Figs. 3 and 4. First, performance benefits under partial feedback emerged not only for selective updating of winners but likewise for selective updating of losers. Second, performance was highest for extreme asymmetries where the loser (or winner) in a pair was not updated at all. We examined these aspects more closely on the individual participant level (Fig. 5). Half of the participants in Exps. 2–4 (n = 64) were indeed characterized by extreme asymmetry towards winners (with α near zero). However, another subgroup (n = 15) showed the opposite, an extreme asymmetry towards losers (with α+ near zero). In other words, in the partial-feedback setting, most individuals showed an extreme bias towards winners or losers, either of which proved to be an optimal policy in our model simulations (Fig. 2e and Extended Data Fig. 2, right). In contrast, we found less substantial asymmetries under full feedback (Exp. 1) when allowing the learning rates for winners and losers to vary freely (that is, using model Q2 instead of the winning model Q1). Statistical analysis confirmed that the asymmetries under full feedback (Exp. 1) were significantly lower than under partial feedback (Mann–Whitney U-test of absolute asymmetry indices collapsed over Exps. 2–4 (n = 128) compared with Exp. 1 (n = 17): P = 0.007; z = −2.69; r = 0.22; 95% CI, 0.04 to 0.39; Methods).

A potentially surprising observation in the subgroup of participants in Exps. 2–4 who selectively updated winners (Fig. 5, right) is a tendency for below-chance performance for relatively high-valued non-neighbours (for example, A–D) despite each individual performing robustly above chance overall (‘Participants’). A potential explanation is that participants may sometimes have confused the two pair items in working memory at the time of feedback (Fig. 1b, right). Such memory confusions would result in the items occasionally being updated with the incorrect learning rate and the incorrect sign (‘Models’, equation (4)). Under an updating policy that ignores losers, the losing items would then be updated only in error (and always incorrectly), resulting in a negative net learning rate for losers. Indeed, repeating our analysis while allowing for negative values of α+ and/or α yielded a small but significant improvement in model fit (mean BIC, 562.52 ± 15.23 compared with 567.60 ± 15.38; Wilcoxon signed-rank test: P < 0.001; z = −5.58; n = 128; r = 0.50; 95% CI, 0.36 to 0.62). More specifically, in the n = 64 participants who selectively updated winners (that is, with a positive α+; mean = 0.063 ± 0.007), α estimates were weakly negative (mean = −0.009 ± 0.0016; Wilcoxon signed-rank test against zero: P < 0.001; z = 4.95; n = 64; r = 0.64; 95% CI, 0.47 to 0.78). Memory confusions may thus explain the systematically false inferences about certain item pairs (Fig. 5, right). Together, these findings are consistent with a strongly asymmetric learning mechanism that is also prone to occasional memory errors.

To summarize our empirical findings, when transitive relations could be inferred only from local comparisons (Exps. 2–4), human learning was characterized by an asymmetric outcome attribution to either winners or losers, which proved to be surprisingly optimal in model simulations. In contrast, a symmetric attribution of relational outcomes emerged in a setting where all pair relations could be directly experienced (Exp. 1), and for which our simulations identified symmetric updating to be the most efficient.

## Discussion

Reasoning about the relationships between arbitrary pairings of items is a key component of human intelligence. Through simulations, we showed how different learning regimes perform better in full- and partial-feedback contexts. Under full feedback, the best learning model used symmetric learning to update the value estimates for the winning and losing items in opposite directions, with the same magnitude. However, under partial feedback (only for neighbouring items), the best learning model used asymmetric learning to only update the value representations for either the winner or the loser. Across four experiments, we found robust evidence that human learners used the best learning rule to match their feedback context. The participants used symmetric learning under full feedback (Exp. 1) and asymmetric learning under partial feedback (Exps. 2–4). While our asymmetric models allowed for a wide range of possible learning rate combinations, a majority of the participants showed one-sided learning, where value representations were only updated for either winners or losers.

An important feature found both in our model simulations and participant behaviour is a compression of the emerging implicit value structure, which results in a systematic decrease in discriminability of higher-valued items (Fig. 2f and Fig. 4). This resembles the Weber–Fechner law in psychophysics29, where sensitivity to stimulus differences diminishes with increasing magnitude (see also refs. 32,38). While there exist alternative theoretical accounts for this ubiquitous phenomenon39, our findings add a new perspective: compressed representations of magnitude emerge naturally from a learning policy that is optimized for inferring global relationships from local comparisons. From this perspective, subjective compression might not only reflect an efficient adaptation to the distribution of stimuli in the environment40,41,42 but could also result from learning policies that enhance transfer to novel relationships.

In other contexts, previous RL studies have discovered different types of learning asymmetries, such as between positive and negative24,25 or confirmatory and disconfirmatory outcomes28. The one-sided learning policy highlighted here in the context of transitive inference is orthogonal to these other asymmetries but may play a similar role in leveraging a biased but advantageous learning strategy (see also refs. 27,43,44). Unlike with ‘optimal’ cognitive biases reported previously45,46,47,48,49, we did not find the benefit of the present learning asymmetries to emerge from general limitations (noise) in decision-making (Extended Data Fig. 10). We speculate that human learners may adopt the present biases more strategically, in settings where the availability of only sparse feedback presages the requirement of future inferential judgements.

Previous theories have proposed richer and more complex cognitive mechanisms for transitive inference, often with an emphasis on the key role of the hippocampus in representing relational knowledge15,50. Early research appealed to the idea that individuals used spatial representations to learn ordered value sequences1,8,51. More recently, various models have been proposed that use associative learning mechanisms to describe how interactions between episodic memories in the hippocampus can generalize relational knowledge from local to distant comparisons17,52. In our present experiments, we found no evidence for transitive inference through such ‘associative linking’ and failed to observe its key empirical prediction (an inverse symbolic distance effect; Fig. 3b and Extended Data Fig. 1b, right). We show instead that simpler mechanisms of value learning21,53,54 combined with clever biases (that is, asymmetric learning rates) can be sufficient for performing transitive inference and for accurately describing human learners.

While our model explains transitive inference via learning of individual item values, our findings do not preclude the emergence of a more explicit (for example, map-like) mental model of the items’ relational structure55 or the possibility that participants develop direct action policies for each pairing56 after learning progress. A related question for future work is to what extent learning from relational feedback may transfer to comparisons with new items that were not contained in the learning set.

To summarize, we report evidence for pronounced asymmetries in transitive relational learning, where observers selectively update their beliefs only about the winner (or the loser) in a pair. Although asymmetric learning yields distorted value representations, it proves beneficial for generalization to new, more distant relationships. This biased learning regime thus seems well adapted for navigating environments with relational structure on the basis of only sparse and local feedback.

## Methods

### Participants

The participants in Exps. 1 and 2 were recruited from a participant pool at the Max Planck Institute for Human Development. Of these, n = 20 participated in Exp. 1 (13 female, mean age 27.15 ± 3.91 years), and n = 35 participated in Exp. 2 (14 female, 27 ± 3.80 years). The participants in Exps. 3 and 4 were recruited online via Prolific Academic (www.prolific.co), with n = 76 completing Exp. 3 (23 female, 24.73 ± 5.40 years) and n = 60 completing Exp. 4 (23 female, 25.92 ± 4.54 years). The participants in Exps. 1 and 2 received compensation of €10 per hour and a bonus of €5 depending on performance. Payment in Exps. 3 and 4 was £4.87 (£1.46 bonus) and £3.75 (£1.12 bonus), respectively. We obtained written informed consent from all participants, and all experiments were approved by the ethics committee of the Max Planck Institute for Human Development.

Participants who did not reach above-chance learning levels were excluded from analysis. The threshold for inclusion was set to 60% correct judgements in the last two blocks of the experiment, which corresponds to a binomial test probability of P < 0.01 compared with chance level (50%). After exclusion, n = 17 (Exp. 1), n = 31 (Exp. 2), n = 48 (Exp. 3) and n = 49 (Exp. 4) participants remained for analysis.

### Stimuli, task and procedure

In Exps. 1 and 2, eight pictures of everyday objects and common animals were used as stimuli (Fig. 1a). In Exps. 3 and 4, we included 12 additional pictures of objects and animals and selected for each participant a new subset of 8 images as stimuli. An additional set of 8 pictures was used for instructions and practice purposes in each experiment. All images were from the BOSS database57, with the original white background removed.

All experiments involved learning the latent relations between the eight stimuli (A > B > C > D > E > F > G > H) through pairwise choice feedback, where the latent value structure was pseudo-randomly assigned to the pictures for each participant. On each trial, a pair of pictures was presented, and the observers were asked to choose the higher-valued stimulus (two-alternative choice with time-out). All possible stimulus pairings (7 neighbouring and 21 non-neighbouring) were randomly intermixed across trials, with randomized ordering of the elements in a pair (for example, A–B or B–A). Prior to all experiments, the participants were given written instructions and were asked to complete two brief practice blocks to become familiar with the task.

#### Experiment 1 (full feedback, n = 17)

On each trial in Exp. 1, two items were presented one after the other at fixation (0.5 s per item) with an inter-stimulus interval of 2–3 s (randomized). After the second item, Arabic digits ‘1’ and ‘2’ were displayed to the left and right of fixation (the positions were randomized across trials), and the participants were asked to choose the higher-valued item by pressing the corresponding arrow key (left or right) within 2 s. A written feedback message (‘great’ for correct responses, ‘incorrect’ for errors) was shown after each choice (neighbouring and non-neighbouring pairs). The items’ latent values in Exp. 1 were probabilistic (with a Gaussian distribution) and designed such that feedback was truthful on approximately 80% of neighbour trials (probabilistic feedback). Each participant performed 448 learning trials with all possible stimulus pairings (n = 56) presented in each of eight consecutive blocks. Experiments 1 and 2 were conducted in lab, using Psychophysics Toolbox Version 3 (ref. 58) running in MATLAB 2017a (MathWorks).

#### Experiment 2 (partial feedback, n = 31)

The design of Exp. 2 was nearly identical to that of Exp. 1, but choice feedback was given only after neighbouring pairs. After non-neighbouring pairs, a neutral ‘thank you’ message was displayed instead. Neighbouring pairs were presented more often (2.5 times as often as non-neighbouring pairs), resulting in 616 trials (presented in 8 blocks of 77). In Exp. 2, we additionally recorded EEG, and the participants performed a brief picture-viewing task before the experiment. These data were collected for the purpose of a different research question and are not reported here.

#### Experiment 3 (partial feedback, n = 48)

The basic design of Exp. 3 was identical to that of Exp. 2, except for the following changes. Both pair items were displayed simultaneously on the screen for 2.5 s, one to the left and the other to the right of a centred fixation cross. The participants were instructed to quickly select the higher-valued item using the left or right arrow key. After neighbouring pairs, a feedback message (‘win’ or ‘loss’) was presented. After non-neighbouring pairs, no feedback message was shown. Experiments 3 and 4 were programmed in PsychoPy v.2020.1.3 (ref. 59) and conducted online (pavlovia.org), with intermittent attention checks.

#### Experiment 4 (partial feedback, deterministic, n = 49)

The design of Exp. 4 was identical to that of Exp. 3, but feedback was always truthful (deterministic feedback). As learning expectedly proceeds faster with deterministic feedback, neighbouring pairs were presented only two times as often as non-neighbours, and we reduced the number of trials to 420 (presented in 6 blocks of 70 trials).

### Models

#### Item-level learning

To model how observers update their value estimates about the winning item i and the losing item j after relational feedback, we assume a simple delta rule60 (Model Q1):

$$Q_{t + 1}(i) = Q_t(i) + \alpha \left[ {1 - Q_t(i)} \right]$$
(1a)
$$Q_{t + 1}(j) = Q_t(j) + \alpha \left[ { - 1 - Q_t(j)} \right]$$
(1b)

where Qt is the estimated item value at time t, and α is the learning rate.

Transitive inference is enabled by a modified updating rule (similar to refs. 14,22) based on the relative difference dt(i, j) between the value estimates for the winner i and the loser j in a pair:

$$d_t(i,j) = \eta \left[ {Q_t(i) - Q_t(j)} \right]$$
(2)

where η is a scaling factor. Value updating is then moderated by the extent to which feedback is consistent (or inconsistent) with dt(i, j) (Model Q1*):

$$Q_{t + 1}(i) = Q_t(i) + \alpha \left[ {1 - d_t(i,j) - Q_t(i)} \right]$$
(3a)
$$Q_{t + 1}(j) = Q_t(j) + \alpha \left[ { - 1 + d_t(i,j) - Q_t(j)} \right]$$
(3b)

for the winning item i and the losing item j, respectively. Note that equation (1) is a special case of equation (3) when η = 0.

We can allow asymmetric updating of winners and losers by introducing separate learning rates, α+ and α (Models Q2 and Q2*):

$$Q_{t + 1}(i) = Q_t(i) + \alpha ^ + \left[ {1 - d_t(i,j) - Q_t(i)} \right]$$
(4a)
$$Q_{t + 1}(j) = Q_t(j) + \alpha ^ - \left[ { - 1 + d_t(i,j) - Q_t(j)} \right]$$
(4b)

where the winning item i is updated via α+, and the losing item j is updated via α.

To convert the value estimates from item-level learning into pairwise choice probabilities for any two items i and j, we use a logistic choice function to define the probability of choosing i > j on the basis of the difference between the estimated item values:

$${\mathrm{CP}}_{{\mathrm{item}},\,t} = \frac{1}{{1 + {\mathrm{exp}}( - (Q_t(i) - Q_t(j))/\tau _{{\mathrm{item}}})}}$$
(5)

where τitem is the (inverse) temperature parameter controlling the level of decision noise in choices based on item-level learning.

#### Pair-level learning

For the partial-feedback scenario, we also define an alternative learning model (Model P) that learns pairwise preferences between neighbouring items (rather than the individual items’ values). For each neighbouring pair n (1…7), we can describe the preference between its members (for example, pA>B) probabilistically in terms of a beta distribution:

$$p_n \sim {\mathrm{Beta}}(U_n,\,L_n)$$

Following truthful feedback (for example, ‘correct’ when A > B was chosen), the upper value of the beta distribution is updated, increasing the preference in favour of the higher-ranking pair member:

$$U_{n,t + 1} = U_{n,t} + \gamma$$
(6a)

whereas following untruthful feedback (only in experiments with probabilistic feedback; see Exps. 2 and 3), the lower value is updated, reducing the preference for the higher-ranking member:

$$L_{n,t + 1} = L_{n,t} + \gamma$$
(6b)

with γ acting as a learning rate. We can thus define the learned neighbour preference at time t on the basis of the expectation of the beta distribution (Model P):

$$p_{n,t} = \frac{{U_{n,t}}}{{U_{n,t} + L_{n,t}}}$$
(7)

where pn,t = 0.5 reflects indifference, and values of pn,t larger (or smaller) than 0.5 reflect a preference for the higher (or lower) ranking pair member. While this mechanism can learn the relations between neighbouring items under partial feedback, it fails to learn the relations between non-neighbouring items, for which there is no direct feedback signal. However, transitive inference of preferences between non-neighbouring items is possible through associative recall of those neighbour preferences that ‘link’ the two non-neighbour items in question. To allow for this possibility, we define the inferred preference between any two items i and j via the set M of intermediate neighbour preferences $$p_{n,t}$$ separating i and j (Model Pi):

$$p_{i > j,t} = \frac{{\mathop {\sum}\nolimits_{p_{n,t} \in M} {(p_{n,t} - 0.5)} }}{{|i - j|^{\lambda + 1}}} + 0.5$$
(8)

where |i − j| is the rank distance between the items’ true values, and λ is a free parameter reflecting failure to retrieve linking pair preferences in the range [0, ∞]. If λ = 0, preferences between non-neighbours will be a lossless average of all intermediate neighbour preferences (that is, perfect memory). As λ grows, the preference between non-neighbours will shrink to indifference with increasing distance between j and i. In other words, this model performs perfect transitive inference if λ = 0 and no transitive inference as λ → ∞. Note that for neighbour pairs (where |i − j| = 1), equation (8) is equivalent to equation (7).

We again use a logistic choice rule to define the probability of choosing item i over j on the basis of pair preference pi>j,t subject to decision noise τpair:

$${\mathrm{CP}}_{{\mathrm{pair}},\,t} = \frac{1}{{1 + {\mathrm{exp}}( - p_{i > j,t}/\tau _{{\mathrm{pair}}})}}$$
(9)

From equations (6)–(9), we constructed alternative models incorporating basic pair-level learning (Model P, where λ is fixed at a large value) and pair-level transitive inference (Model Pi, where λ is a free parameter).

To combine item-level (equations (1)–(5)) and pair-level (equations (6)–(9)) learning, we assume that choices are triggered by whichever of the two models provides a stronger preference on a given trial. Choices are thus based on item-level learning (CPitem) if:

$$|{\mathrm{CP}}_{{\mathrm{item}},\,t} - 0.5| > |{\mathrm{CP}}_{{\mathrm{pair}},\,t} - 0.5|$$
(10a)

and are based on pair-level learning (CPpair) if:

$$|{\mathrm{CP}}_{{\mathrm{item}},\,t} - 0.5| < |{\mathrm{CP}}_{{\mathrm{pair}},\,t} - 0.5|$$
(10b)

This effectively implements a mixture of item- and pair-level learning.

#### Model space

From equations (1)–(10), we constructed a nested model space (Extended Data Fig. 1c) with either one or two learning rates (1, symmetric; 2, asymmetric updating; equation (4)). One set of models allows for simple item-level RL only (Models Q1 and Q2) or additionally for item-level transitive inference (Models Q1* and Q2*, equations (1)–(5)). Alternative models (equations (6)–(9)) incorporated pair-level learning (Model P) and pair-level inference (Model Pi). Mixture models (equation (10)) combined item-level and pair-level learning (Q1* + P, Q2* + P, Q1* + Pi and Q2* + Pi). Technically, all models under study were derived from the most flexible model, Q2* + Pi, with individual parameter restrictions (for example, γ = 0 yields Model Q2*, or α+ = α yields symmetric updating).

#### Performance simulations

We simulated the performance of our item-level learning models (Q1, Q2, Q1* and Q2*) in tasks akin to those used in the human experiments, with full and partial feedback (Fig. 2 and Extended Data Figs. 1 and 2). The performance simulations were run in MATLAB R2020a (MathWorks). The models were initialized with flat priors about the item values (all Q1(i) = 0—that is, the first choice was always a random guess with CP1 = 0.5). As in the human experiments, choice feedback was provided either for all pairs (full feedback) or only for neighbour pairs (partial feedback). We simulated model performance over a range of learning rates (α+ and α, 0 to 0.1 in increments of 0.001). Relational difference-weighting (η) was set to either 0 (Models Q1 and Q2) or 8 (Models Q1* and Q2*), and decision noise (τitem) was set to 0.2 and 0.04 (full and partial feedback), which resembles the noise levels estimated in our human observers in the respective experiments. Mean choice probabilities (for example, Fig. 2a, bottom) and performance levels (for example, Fig. 2b) were simulated using the same number of trials and replications (with a new trial sequence) as in the respective human experiments. Simulation results under partial feedback (Fig. 2e and Extended Data Figs. 2 and 3) were qualitatively identical when inspecting performance on non-neighbouring pairs only.

#### Parameter estimation and model comparison

Model parameters were estimated by minimizing the negative log-likelihood of the model given each observer’s single-trial responses (from all trials in the experiment) across values of the model’s free parameters (within bounds (lower;upper): α/α+/α(0;0.2), η(0;10), τitem(0;1), γ(0;1), λ(0;100), τpair(0;1), with a uniform prior). The best-fitting parameter estimates are shown in Extended Data Fig. 4. Model fitting was performed in R (ref. 61). Minimization was performed using a differential evolution algorithm62 with 200 iterations. We then computed the BIC of each model for each participant and evaluated the models’ probability of describing the majority of participants best (pxp)63. In Fig. 3e,f, we also provide a pseudo-R2 computed as R2 = 1 − (BICmodel/BICnull), which quantifies goodness of fit relative to a null model of the data, with larger values indicating better fit (similar to ref. 64). Model comparisons for Exp. 1 (full feedback) were restricted to item-level learning models, as the availability of direct feedback for every pairing would equate pair-level learning models (P and Pi) to homogenous learning of all pairs, obviating contributions from transitive inferences.

To quantify model-estimated asymmetry (Fig. 5), we computed an index of the normalized difference in learning rates, A = (α+ − α)/|(α+ + α)|, which ranges from −1 (updating of losers only) to 1 (updating of winners only), with A = 0 indicating symmetric updating. For comparison between full- and partial-feedback experiments, we contrasted the absolute |A| estimated from the winning model in Exps. 2–4 (Q2* + P, see Fig. 3f) with that estimated from Model Q2 in Exp. 1.

#### Model and parameter recovery

To establish whether the individual models can be distinguished in model comparison, we simulated, for each participant and model, 100 experiment runs using the individuals’ empirical parameter estimates under the respective model. We then fitted the generated datasets (binomial choice data) with each model and evaluated how often it provided the best fit (in terms of BIC). This way, we estimated the conditional probability that a model fits best given the true generative model (p(fit|gen)). However, a metric more critical for evaluating our empirical results is p(gen|fit), which is the probability that the data was generated by a specific model, given that the model was observed as providing the best fit to the generated data65. We compute this probability using Bayes’s theorem, with a uniform prior over models (p(gen)):

$$p({\mathrm{gen}}|{\mathrm{fit}}) = \frac{{p({\mathrm{fit}}|{\mathrm{gen}})p({\mathrm{gen}})}}{{\mathop {\sum }\nolimits_{{\mathrm{sim}} = 1}^{n{\mathrm{Models}}} p({\mathrm{fit}}|{\mathrm{gen}})_{{\mathrm{sim}}}p({\mathrm{gen}})_{{\mathrm{sim}}}}}$$

To mimic the level of inference in our human data fitting, we examined mean p(fit|gen) and p(gen|fit) on the experiment level, on the basis of full simulations of all participants in Exp. 1 (full feedback) and Exp. 2 (partial feedback). Critically, under partial feedback (Exps. 2–4), all our models were robustly recovered with this approach (Extended Data Fig. 5).

Under full feedback (Exp. 1), human participant behaviour was best characterized by symmetric learning rates (α+ ≈ α), even when both learning rates were free parameters (Figs. 3e and 4). To test whether we could have detected asymmetric learning had it occurred in Exp. 1, we enforced asymmetry in the simulation by setting α to values near zero (by drawing from a rectified Gaussian with µ = 0 and σ = 0.01). We likewise enforced difference-weighted updating (η > 0) when simulating Model Q2*, by setting η to similar levels as empirically observed in the partial-feedback experiments (µ = 3 and σ = 0.5). With this, the model recovery for Exp. 1 successfully distinguished between symmetric (Q1 and Q1*) and asymmetric learning models (Q2 and Q2*; Extended Data Fig. 6). However, models with difference-weighted updating (Q1* and Q2*; equations (2) and (3)) were partly confused with Models Q1 and Q2. In other words, our empirical finding of Q1 as the winning model in Exp. 1 (Fig. 3e) does not rule out the possibility of Q1* as the generative process under full feedback.

To establish whether our inferences about model parameters (for example, Fig. 5) are valid, we simulated choices under partial feedback (Exp. 2) using our winning model (Q2* + P). Choice datasets were simulated using each participant’s empirical parameter estimates and iteratively varying each parameter over 20 evenly spaced values within the boundaries used in ‘Parameter estimation and model comparison’ (see above). We then fit the model to the simulated datasets and examined the correlations between generative and recovered parameters (Extended Data Figs. 7 and 8). All fitted parameters correlated most strongly with their generative counterparts (min 0.59, max 0.93), while correlations with other generative parameters were generally weaker (min −0.44, max 0.43).

### Statistical analyses

The behavioural and modelling results were analysed using non-parametric tests (two-sided), as detailed in the Results. In the case of multiple tests, the maximum P value (uncorrected) is reported in the main text, while the individual test results are detailed in Supplementary Tables 14.

### Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.