## Introduction

All animals must learn latent features of their environments, those that must be inferred from observations. Many studies in psychology and related disciplines investigate how animals can learn associations between features and rewards. While perceptible features in the environment can be observed and directly reinforced by the outcomes of an individual’s choices, latent features in environments cannot be so easily reinforced—they must first be learned1,2,3,4,5,6,7,8,9,10. These latent features often capture the statistical structure of situations in the environment across stimuli, actions, time, space and a range of variables internal to the organism such as mood or cognitive state11. Many contexts involve learning latent features on the basis of observable ones, ranging from the mundane, such as determining the ripeness of a fruit, the latent feature, from its observable color, smell, or softness12, to the esoteric, such as how to defeat a video game, where players have to learn each level’s structure, the latent features, from different sequences of button presses and observable outcomes13. Despite the relevance to many aspects of cognition, how humans and other animals learn these latent features is only recently a focus of research in psychology and neuroscience.

Latent features are often learned from the outcomes of choices. Outcome-based learning of latent features is typically studied by using rewards such as food or water to provide feedback that can be used to augment behavior14,15,16. In the real world, however, many behaviors go momentarily unrewarded and yet learning still occurs17. Focusing only on recent rewards, then, fails to fully capture learning. Besides reward, the outcomes of choices also provide information, here understood as changes in the probabilities of observable features or causes appearing in the environment. This information can also be used for learning; specifically, the acquisition of information can help identify a latent feature. Since knowledge of latent features can be critical for developing and applying efficient strategies to obtain reward and avoid aversive outcomes in the long-term, a quest for information can itself be a motivating force for behavior.

Information seeking must often be balanced against reward seeking. Examples of such reward-information trade-offs are rampant in human and animal environments. For example, long-tailed macaques will sometimes forego eating observable rewards to uncover and consume hidden ones first before returning to eat the observable ones18. Humans will change their choices to reflect their epistemic uncertainty about the environment, selecting options that will provide information when they know they will have the opportunity to use that information to make further choices, even if prospective reward is less19. Understanding how outcomes from rewards or from information impact learning is consequently important for understanding how learning proceeds in the messy complexity of many environments.

Learning about a latent feature can involve gathering reward or information from many decisions across different timescales. On a short timescale, such as individual choices, reward or information can reinforce actions and contingencies that have recently occurred. On a longer timescale, such as sequences of choices or even multiple episodes, linking together multiple outcomes can reinforce actions and contingencies that are statistically related. In general, recent attempts to study information gathering for learning latent features have suffered from designs that either cannot fully distinguish between learning from recent reward and expected information outcomes despite requiring many decisions20,21,22,23, fail to investigate gathering information that can be used to learn latent features24,25,26,27,28,29, or fail to investigate how sequences of decisions are used to gather information30,31.

Foraging theory provides a conceptual framework used to understand how animals gather resources32. Foragers can search for different types of resources, including external ones like rewards32, internal ones like memories33 or topics to study34, or abstract ones like information35. Foraging theory outlines a set of basic principles that describe optimal search36,37,38. While reward foraging is well-validated across the phylogenetic spectrum39, this optimal search often assumes knowledge of properties of environments, such as latent features like optimal resource intake rates, that must first be learned40. A possible strategy for learning such latent features is to forage first for information. While information foraging, such as on the internet41, is well-documented, how information foraging impacts learning of latent features is unknown.

To understand the role of reward and information outcomes and foraging in learning latent features over multiple decisions, we developed a novel behavioral paradigm based on the board game Battleship. On each trial, participants started with a grid of unchosen tiles, selected tiles to reveal whether a piece of a shape is hidden beneath the tile, and ended trials when all of a hidden shape’s tiles had been revealed. Revealing a filled or empty square provided information about both the current trial’s shape as well as about the set of possible shapes that could occur across trials. On the current trial, revealing a filled or empty square provides evidence for the hidden shape. Over many trials, participants learn which shapes out of thousands of possibilities could occur. In our task, information (defined terms of changes in the probabilities of different shapes) expected to be gleaned during trials is partly decorrelated from the rewards recently earned from turning over tiles (points for humans or squirts of juice for monkeys). Hence, the effect of expected information and recent reward outcomes on patterns of choices can be disambiguated.

To investigate the pursuit of information in the interest of learning, we had both humans and monkeys perform our task. Here we report on behavior observed in our task; a modeling study will be published separately. We observed that participants from both species were able to learn to reveal shapes. Analysis of choice behavior suggests that tiles that were expected to be informative about the hidden shape tended to be selected earlier in a trial, whereas tiles that had been rewarding in the recent past tended to be selected later. Importantly, over multiple choices, exploratory analysis showed that both humans and monkeys tended to abandon local searches in one part of the grid to jump to a new part when their most recent outcome provided less information than the average for the environment, a type of foraging behavior previously observed across species for gathering rewards. Furthermore, the degree to which the patterns of choices of human participants matched a foraging pattern predicted how quickly shapes were learned. After learning, humans abandoned this pattern of information foraging in favor of gathering rewards. In contrast, monkeys continued to forage for information late in sessions. In sum, evidence from behavior on our task suggests both humans and monkeys searched for information to learn shapes and used foraging computations to decide where to sample spatially on the grid to gather information.

## Results

To study how animals learn latent features over multiple decisions, we designed a shape search task. The task contained thousands of possible latent features—the hidden shapes, sets of connected filled tiles at a location—and identification of these features could facilitate more rapid successful search. The task bears similarity to the board game ‘Battleship’. On every trial, participants searched for one of five shapes (Fig. 1A) hidden on a 5 × 5 tile grid, and locations on the grid could be chosen to reveal either a filled or empty tile. The shapes partly overlapped; for example, the ‘H’ shape overlapped with the backwards-‘L’ shape at the bottom row, middle tile (Fig. 1A, bottom). Hence, uncovering a piece of the shape did not always reveal the identity of the shape for that trial.

Participants (humans: n = 42, who were not instructed as to the number or location of shapes before the task; monkeys: n = 2) uncovered the shape over multiple choices by selecting tiles (Fig. 1A). At trial start, participants made a movement to a target at the center of the screen (humans: mouse-over; monkeys: saccade) until it disappeared after a variable delay. Participants then had unlimited time to choose a target. After a choice, if participants uncovered a filled tile (‘hit’), then they received a reward (points for humans, juice for monkeys) before proceeding to the next choice. If participants failed to uncover a filled tile (‘miss’), then they proceeded to the next choice with no reward. After each choice outcome and an inter-choice interval, the fixation point reappeared, and the sequence repeated until the shape was fully revealed (see “Methods” section). Trials ended once all filled tiles that were part of the shape were uncovered. A trial, then, is the set of choices and outcomes from initial fixation with a fully occluded shape to the last choice that finished revealing the shape.

To visualize performance, we plotted the proportion of choices that maximized expected rewards as a function of trial number (humans: 45 min session with as many trials performed as possible, mean ± s.e.m. = 67.69 ± 3.06 trials; M1 and M2: untimed sessions concatenated across 7 days, M1 = 1015 trials, M2 = 1288 trials). At each choice, some of the five shapes remain and, since the shapes overlapped, participants that choose tiles with the most overlap among the remaining shapes will maximize expected rewards. This strategy corresponds to the Bellman optimal policy for selecting squares14. The proportion of choices that maximized reward increased over sessions for both species, with human participants faster than monkeys in stabilizing the proportion of choices that maximized reward per trial around 70% (Fig. 1C). The remainder of this paper will focus on analyses of this learning behavior.

To understand this change in performance, we needed to define the learning period. Learning starts with the first trial. We reasoned two main effects should be evident to mark the end of learning. First, the mean number of choices to finish revealing a shape should diminish during learning. Second, the variance in the number of choices should also diminish as participants learned the shapes [cf.42]. To detect changes in these values, a changepoint detection test was run on the mean and variance of participants’ choices across all shapes and trials (see “Methods” section;43,44. The end of learning was set to the last changepoint (whether due to changes in mean or variance) that was detected across all shapes.

All participants learned to reveal the shapes. Participants learned to select one of the four highest expected reward tiles on the first choice in a trial (Fig. 1B). The proportion of such choices during learning (0.1615 ± 0.0268) was significant smaller than the proportion after learning (0.3519 ± 0.0541; paired t-test, t(df = 38) = − 4.1787, p < 0.0005). In addition, all participants outperformed both an algorithm that randomly selected tiles on every choice (Fig. 1C, green lines below data) and an algorithm that randomly selected tiles until a hit and searched locally thereafter (Fig. 1C, blue lines below data). This suggests that random choices and simple local search were not used to reveal the shapes. To confirm that participants learned to reveal shapes, we examined the learning curves for different shapes on the task (Fig. 2). A sample learning curve for one shape for M1 is shown in Fig. 2A, which illustrates how the total number of choices to finish revealing a shape decreased over the duration of the task (OLS, p < 1 × 10–10, β = − 0.0141 ± 0.0019), approaching optimal (thick blue line; see “Methods” section for how we computed the optimal number of choices). This pattern was evident across shapes (Fig. 2B, top; OLS, all β’s < 0, all p’s < 0.05, βmean = − 0.0129 ± 0.0023, Student’s t-test: t(df = 4) = − 5.5145, p < 0.01) and in four of five shapes in M2 (Fig. 2B, bottom; OLS, 4 β’s < 0, 1 β > 0, all p’s < 1 × 10–6; βmean = − 0.0071 ± 0.0044, t(df = 4) = − 1.6173, p > 0.1). We view this variability in monkey behavior as a boon for future investigation of the neural circuits underlying this learning. Humans showed quicker convergence to optimal numbers of choices by at least an order of magnitude (Fig. 2C; mean across shapes and subjects: β = − 0.2673 ± 0.0722, Student’s t-test: t(df = 4) = − 3.7025, p < 0.05).

### Single choices

Outcomes from each choice deliver both rewards and information about the hidden shape. To disentangle the relative contribution of reward and information on choice, we computed the expected reward outcome and expected information outcome for each tile, updated after each choice made by participants. For each tile, the expected reward is defined by the number of hits in the past from the choice of that tile divided by the number of times the tile was previously chosen. This definition was intended to capture the impact of short-sighted reward maximization based on recent reward history on selecting tiles to learn to reveal shapes. By contrast, expected information is computed from the expected change from a hit or miss in the entropy of the probability distribution over the possible shapes (see “Methods” section). Informally, one can think of these distributions as a set of probabilities, one for each shape. This distribution is updated during trials based on getting hits or misses and updated after the end of the trial based on which shape was observed. To illustrate, consider selecting tiles after shapes are learned. In that context, a high expected information choice would be one that ruled in or out about half of the remaining shapes, such as choosing one of the three tiles at the start of the trial where at least two shapes overlapped (see Fig. 1B), whereas a low expected information choice would be one that failed to rule in or out any shapes, such as choosing a tile that is never part of a shape. Unlike recent reward maximization, maximizing expected information is more helpful for learning shapes because it rules in or out shapes altogether and so includes tiles that are rarely or never selected. Expected reward and expected information across choices were weakly uncorrelated (humans: ρ = − 0.15; both monkeys: ρ = − 0.09).

To assess the impact of reward and information on choices, a multinomial logistic regression was performed to regress choice number in trial against trial number in session, the expected information for the chosen tile, and expected reward for the chosen tile (Fig. 3). A positive regression coefficient implies that the effect of the variable was to select a tile earlier in a trial, whereas a negative coefficient implies the effect was to select a tile later in a trial. For human participants, the mean coefficient across participants and choice numbers in a trial for expected information was positive and significantly different from the negative mean coefficient for expected reward (t(df = 14) = 6.18, p < 1 × 10–4; mean βinfo = 1.37 ± 0.41; mean βreward = − 0.19 ± 0.14). In humans, then, greater expected information correlated with higher probability of earlier choices of tiles whereas greater expected reward correlated with higher probability of later choices of tiles. Paired t-tests of expected information and expected reward coefficients for each choice number revealed that information influenced choice significantly more than reward (t(df = 14) = 6.21, p < 5 × 10–5).

The same outcomes correlated with the choices of monkeys. As in humans, expected information more positively influenced choice than expected reward in both monkeys; that is, in monkeys, higher expected information predicted earlier selection of a tile in a trial whereas higher expected reward predicted later selection. Across choice numbers in a trial, M2 was only marginally more driven by information (t(df = 7) = − 2.03, p = 0.0822) but significantly more driven by reward than M1 (t(df = 7) = − 2.96, p < 0.05). The influence of information on choice was greater than reward in M1 (t(df = 7) = 3.66, p < 0.01) and M2 (t(df = 7) = 3.46, p < 0.05).

### Pairs of choices

After considering whether expected outcomes drove participants' single choices, we explored how participants made sequences of choices, starting with pairs. We wondered if participants explored the grid to learn the hidden shapes by choosing neighboring tiles. Humans and to a lesser extent monkeys increased their probability of choosing a neighboring tile after a hit over time (Fig. 4A, left column; OLS; humans, first row: β = 0.0037 ± 0.0002 fraction of choices of neighboring tile after hit/trial, p < 1 × 10–33; M1, second row: β = 0.00013 ± 0.000018, p < 1 × 10–10; M2, third row: β = 0.000055 ± 0.000013, p < 1 × 10–4). In contrast, the probability of choosing a neighboring tile after a miss decreased over time for humans and for monkeys, albeit more weakly (Fig. 4A, right column; OLS; humans, first row: β = − 0.0014 ± 0.0002, p < 1 × 10–14; M1, second row: β = − 0.000024 ± 0.000013, p = 0.0695; M2, third row: − 0.000091 ± 0.0000093, p < 1 × 10–20). In addition, following hits, humans showed a greater increase in their tendency to select neighboring tiles that were also hits than monkeys (Fig. 4B; OLS; Humans: β = 0.0046 ± 0.00024 fraction chose neighboring hit after hit/trial, p < 1 × 10–26; M1: β = 0.00018 ± 0.000020 fraction of choices neighboring hit after hit/trial, p < 1 × 10–16; M2: β = 0.00014 ± 0.000015 fraction chose neighboring hit after hit/trial, p < 1 × 10–9). In sum, while members of both species tended to more frequently choose hits following hits over time, humans increased their probability of doing so more quickly than monkeys.

To explore which variables influenced these choices, we ran a mixed-effects binomial regression on human participants’ choices. The dependent variable was choice of neighboring tile (= 1) or not (= 0), and the independent fixed-effect variables were trial number in session, choice number in trial, information outcome from the previous choice, expected information for the current choice, reward outcome from the previous choice, and expected reward for the current choice and independent random-effect variable of subject identity. All main effects were significant (p < 0.05, Bonferroni corrected). The effect of information and reward outcome was positive (βinfo = 1.75 ± 0.069; βreward = 1.34 ± 0.056), indicating that larger information or reward outcomes predicted choice of a neighboring tile. Larger information or reward outcomes tend to result from hits; since shapes were connected filled tiles, after a hit, some number of the neighboring tiles will typically also be parts of shapes. The effect of expected information and expected reward were negative (βexp_info = − 0.87 ± 0.091; βexp_reward = − 0.51 ± 0.061), indicating that larger expected information or reward predicted choice of a non-neighboring tile. Larger expected information or reward tend to result from misses; if a choice is a miss, then the probability that a shape is nearby is low, and so the next choice should be in a different part of the grid. The same regression was run on monkeys, which revealed significant main effects of trial number, expected information, and information and reward outcomes but not choice number or expected reward (p < 0.05, Bonferroni corrected). The sign of the significant main effects matched the human participants (βinfo = 0.86 ± 0.0403; βreward = 1.41 ± 0.048; βexp_info = − 0.33 ± 0.045), consistent with similar motivations for choosing a neighboring tile or not. Participants were driven to search neighboring tiles by good outcomes and to search further away by bad ones.

### Sequences of choices

We were primarily interested in how subjects explored the grid to learn to reveal the shapes (Fig. 1B,C). Looking at participants’ sequence of choices when they did not choose neighboring tiles can reveal patterns of outcomes that aid in learning. Participants may search in a restricted area45, hunting for shapes in a part of the grid. We operationalized such an area restricted search as the choice of at least three neighboring tiles in sequence. We considered choices when participants decided to ‘jump’, a choice of a tile more than one tile away from the previous choice (i.e., the choice of a non-neighboring tile), after such area restricted search. A significant number of trials contain these sequences (humans: 0.28 ± 0.012 fraction of trials; M1: 0.27; M2: 0.29). These jumps are the result of decisions to sample a new area of the grid.

Both human and monkey participants tended to jump after these sequences of choices (three neighboring tiles) when the information intake dropped below the average information intake, computed from the information outcomes from all choices across all subjects but only during learning (Fig. 5A). This drop is the result of learning less about the current hidden shape, for example by getting misses that rule out fewer shapes. In humans, this pattern was not observed when these choices were plotted using reward outcomes and compared to the average reward outcome from all choices across all subjects during learning (Fig. 5B, left panel). This finding was confirmed using an alternative operationalization for the end of learning (see “Methods” section and supplement). Monkeys’ jumps were equally well-predicted by recent outcomes falling below the average information (Fig. 5A, middle and right panels) or below the average reward (Fig. 5B, middle and right panels). To explore the outcomes that drove this pattern of choice in humans, we separated the eight distinct sequences of hits and misses for three outcomes in sequence by the last outcome in the sequence, either a hit or a miss, and plotted the information outcomes before jumping based only on last choice misses and last choice hits (Fig. 5C, left panel). Information foraging was driven primarily by misses: the pattern disappears if misses are left out but not if hits are left out. This pattern was observed in M1 but not M2 (Fig. 5C, middle and right panels respectively).

The unexpected observation, of jumps to a different part of the grid when recent outcomes fall below the average, qualitatively matches foraging patterns of choices32. Animals can forage for a range of resources, including rewards32, information46, social interactions47 and more. When foraging, animals seek to maximize the rate of intake of these resources. A simple rule for maximizing intake rates is to compare the current resource intake rate to either the past average36 or expected37,38 intake. When the current rate drops below that value, the forager should switch from exploiting the current location to exploring for a new one. On our task, participants’ behavior matched this rule.

Does foraging predict how well humans or monkeys learn to reveal shapes? Answering this question required quantifying forager performance and comparing that performance to a measure such as how quickly the shapes were learned. To quantify forager performance, a forager score was computed for each subject for two periods: during and after learning (see “Methods” section). To compute this forager score, we calculated for each subject and each sequence of three neighboring choices followed by a jump whether the information outcomes from the two outcomes before the jump were above the subject’s average information outcome from all choices during learning (above: + 1 point; below: 0 points), above the jump outcome (above: + 1 point; below: 0 points), and whether the pre-jump outcome was above (above: + 1 point; below: 0 points) the average (these analyses were confirmed using a running average instead of the average from all choices as well). Importantly, this score is independent of the changepoint calculation (see “Methods” section). We next took the average across these scores by subject and compared them to the last detected changepoint. Better information foraging scores for humans during learning predicted earlier final changepoints and hence faster learning (Fig. 6; ordinary least squares (OLS), βslope = − 112.54 ± 26.78 trials/a.u. forager score, p < 0.0005, ρ = − 0.57). This finding was confirmed using an alternative operationalization for the end of learning (see “Methods” section and supplement). No such relationship was found for reward foraging scores (OLS, βslope = 41.85 ± 43.99, p > 0.3, ρ = 0.1545, not shown). During learning, human information forager scores (mean forager score FS = 0.67 ± 0.016) were not significantly different from M1 (one-sample t-test; M1 = 0.67; t(df = 38) = 0.36, p > 0.7) and marginally different from M2 (one-sample t-test; M2 = 0.70; t(df = 38) = − 1.90, p > 0.05). After learning, humans (mean FS = 0.56 ± 0.042) had significantly lower information foraging scores than both monkeys (one-sample t-tests; M1 = 0.72, t(df = 37) = − 3.82, p < 5 × 10–4; M2 = 0.72, t(df = 37) = − 3.84, p < 5 × 10–4). Further, humans foraged for information significantly more during learning compared to after shapes had been learned (Student’s t-test, t(df = 75) = 2.63, p < 0.05) whereas monkeys marginally increased (M1: 0.67 during learning, 0.72 after; M2: 0.70 during, 0.72 after). Human reward forager scores during learning (mean forager score FS = 0.3233 ± 0.0115) were significantly lower than after learning (mean forager score FS = 0.3879 ± 0.0223; Student’s t-test, t(df = 75) = − 2.5901, p < 0.05). Humans also showed significantly lower reward forager scores during and after learning than M1 (during learning FS = 0.4082, one-sample t-test, t(df = 38) = − 7.3662, p < 1 × 10–8; after learning FS = 0.4387, one-sample t-test, t(df = 38) = − 2.2276, p < 0.05), and no significant difference during learning but significantly higher forager score after learning than M2 (during learning FS = 0.3063, one-sample t-test, t(df = 38) = 1.4780, p > 0.1; after learning FS = 0.2765, one-sample t-test, t(df = 38) = 4.9921, p < 0.0001). In sum, better information foraging in humans predicted faster learning (as judged by their last changepoint) and humans ceased information foraging after learning.

## Discussion

The challenge of learning environments with many latent features is difficult for any organism. To investigate how humans and monkeys learn these environments, we implemented a search task with many hidden shapes. We discovered that (1) for single choices, informative tiles were preferred earlier in trials than rewarding ones; (2) for pairs of choices, information outcomes predicted decisions to choose a neighboring tile; (3) for sequences of choices, decreases in information outcomes below the average of the environment predicted decisions to sample new areas of the grid, a signature of information foraging; and (4) the degree to which humans’ choice sequences matched foraging patterns predicted learning.

Adaptive decision making in the real world requires learning latent features of the environment. Learning such features requires participants to make multiple decisions over extended periods of time in constantly changing environments and to keep track of outcomes across different timescales. At shorter timescales, outcomes from individual choices provide momentary evidence about the current environment, such as revealing a hit or a miss. Series of outcomes from sequences of choices are at longer timescales and can fully reveal environmental features, such as a shape. At even longer timescales, multiple latent environmental features might be learned, such as many different shapes. Despite the importance of this critical cognitive skill, scant studies have investigated this type of learning at multiple timescales. Here, we probed for the first time in both humans and monkeys how latent features are learned in a temporally extended serial decision-making task where the environment constantly changes and outcomes must be tracked both during trials and across trials.

The drive to learn environments over many decisions can be motivated by the search for reward or for information. We uncovered general preferences for informative options over recently rewarded ones across species at different timescales. On individual trials, tiles that were expected to be more informative tended to be selected sooner than rewarding ones for both humans and monkeys. Classic48 and computational14 reinforcement learning theories use rewards to assign credit to features in the environment. Many improvements to reinforcement learning have been proposed to deal with complexity in the environment, such as the introduction of options that group together many actions49, the use of successor representations50, or the use of belief distributions in partially observable environments to infer latent features from perceptible ones51. However, reward-driven reinforcement learning initially requires many samples over long periods of time to learn reward contingencies52, especially if rewards are rarely delivered and the environment is constantly changing. Instead of reinforcement through the use of rewards, information in the form of changes in the identity and frequency of environmental features can be used to speed up learning by reducing uncertainty—formalizable in terms of information theory53,54—about which features tend to occur together.

Humans are excellent information seekers55,56, efficiently using information to learn their environments57,58 and make inferences about the world59,60,61,62,63,64. Information-based approaches to learning either use that information directly in reinforcement-like processes57,58,65,66,67,68,69 or indirectly in causal or structural inference8,70,71,72. Since animals including humans often do not know the most informative choices, they must rely instead on hypothesis testing73,74,75,76,77,78,79,80,81,82, information foraging41,46,83,84,85,86,87, or maximizing expected information gain57,58,59,88,89,90,91,92,93 when making choices. Past studies have applied information-based approaches to a wide range of tasks, all with single choices and few latent features23,59,61,63,64,69,84,94,95,96,97,98,99,100,101,102,103,104 [for review see56]. Sequential information search is just beginning to be explored in relatively simple environments with one or two features23,64,101,105,106,107. The recent focus in cognitive psychology on learning of latent features use tasks with environments that also often contain one feature and, further, either fail to investigate learning over different timescales (within and across trials)108,109,110, lack a decision-making component111, utilize single choices during trials112,113,114,115,116, or lack the choice complexity of the real world117, though humans do adaptively explore for information in the hope of maximizing long-term rewards19. In our study, we extend this information seeking competence to temporally extended learning of many latent features by uncovering evidence for information-based foraging algorithms.

An unexpected and novel finding of our study was a signature of information foraging behavior that predicted the speed of learning latent features. Many trials in both humans and monkeys showed sequences of choices that reflected an area restricted search (ARS), persistent searching through a limited spatiotemporal region that is a hallmark of foraging behavior32,45,118,119,120. During learning, both humans and monkeys engaged in ARS for information. Decisions to search a different part of the grid were predicted by drops in information intake below the average across all choice outcomes. This average threshold rule is a signature of a standard computation during foraging: using the average outcome from choices across the environment to guide decisions to continue foraging locally or to leave the local area to look for new resources36,37,38. Importantly, humans that better matched the pattern of such foraging learned shapes more quickly. Finally, in humans, this local search strategy was abandoned in favor of a reward-driven strategy once the shapes were learned. While exploratory, this finding can be used to ground predictions about information foraging in our and similar serial decision-making tasks.

Information search and foraging is important for theoretical approaches to complex cognition46,83,118,121. The search for information forms the basis for understanding visual search84,122,123 or chemotaxis85,124. Information search also plays a role in explaining behavior on numerous tasks, including past studies on Battleship-like search tasks57,58,125. We extend these studies by reporting on the properties of sequences of choices to reveal evidence for a foraging information-intake threshold rule. Previous research41,126 has revealed indirect evidence for information foraging for humans surfing the internet. However, unlike our work reported here, these studies measured information in terms of associations between linguistic concepts127. In contrast, our findings are based on non-verbal search and are more generalizable to other effects reported in the literature. The foraging framework more generally has been extended to numerous cognitive processes, including task-switching128, internal word search129, study time allocation34, problem solving130, memory search33, semantic search131, and even social interactions132. The foraging effects in these studies address reward-driven or performance-driven shifts between options. Our study reveals, for the first time, direct evidence for visuospatial information foraging in a cognitive task to learn latent features of the environment, extending the foraging framework to a new class of tasks.

Foraging of all kinds involve costs, such as time and energy spent handling resources or pursuing opportunities. In our task, human participants made mouse movements, which involve costs due to clicking133, movements134, or even due to interactions with executive control135. Monkeys responded with eye movements which are also costly, such as due to planning136 or saccade inaccuracies like longer paths137 or more variable endpoints138, and that are best described by models that include costs139,140,141. A consideration of the role of such costs in decisions for our task should inform any future modeling effort. However, our task lacks other costs such as dangers due to predation or opportunity costs that are omnipresent in foraging generally. A possible way to explore the mechanisms of choice behaviors on our task would be to add explicit costs, such as a penalty for making a miss. However, despite this possibility, we hypothesize that learning would not be any faster. First, getting a miss is effectively a kind of penalty in the sense of a waste of time and energy. Second, the central problem facing the organism is the huge number of possible shapes. Including penalties in the reward function does not reduce the complexity of the state space, which is the set of all possible shapes, that participants needed to navigate. The only plausible solution on the task, in the sense of learning the shapes more quickly than the inevitable heat death of the universe, is to use information to guide one’s choices. Penalties do not negate the advantages of that strategy.

The dictates of optimal foraging theory assume that foragers have knowledge of foraging-relevant parameters, such as average intake rates, frequency of patch types, and so forth [see, e.g.,142]. That animals might learn these parameters has been proposed for as long as the theory has been in existence40,143. We extend these discussions in two fundamental ways, one conceptual and the other empirical. Conceptually, we extend foraging for learning by finding evidence that animals forage for information in order to learn about states of their environment, not just for information about the value of foraging-relevant parameters. Empirically, we uncovered a startling foraging effect: the better that a basic pattern of foraging choices describes a participant’s behavior, the quicker the shapes were learned. Both are novel contributions to the literature on foraging. We also found that humans seemed to shift to reward foraging after learning in two senses. First, their information foraging score significantly decreased after learning. Second, their reward foraging score significantly increased after learning. These observed changes in the forager score suggest that there was a shift in humans’ behavior from information gathering to reward gathering after learning. This finding might be a feature of search behavior during foraging more generally. At the start of foraging in a novel environment, demands on organisms to gather information might prevail over motivations for gathering rewards. Once information has been gathered, however, organisms might shift to an emphasis on gathering rewards. While numerous studies have investigated how organisms learn foraging-relevant parameters, to our knowledge none have explicitly explored whether such information search can be described as a temporal sequence of information foraging first and then reward foraging second. We submit that our findings inspire the general prediction that shifts to searching for reward will come after a period of foraging best-described as adherence to an average information threshold rule to guide exploration.

We uncovered similar preferences for information in humans and both of our monkeys, though the extent to which animals other than humans search for and use information is less well characterized than it is for humans. Many animals including humans show a preference for advanced information about outcomes that cannot be used to adapt behavior (so-called ‘observing responses’144,145,146,147, including pigeons148, starlings149, rats150, monkeys24,25,26,28, and humans27,151,152). However, these studies tend to use environments with very few features and in the absence of information that can be used to help make later decisions. More recently, the preference for useful information, information that can be used to attain future rewards, has been studied in monkeys29,30,31. However, these studies focus on information gleaned on single trials and do not probe the cognitive capacity to learn multiple features over multiple timescales. Extending previous studies that show that monkeys can recognize and recall simple shapes153, we report the discovery that like humans, monkeys prefer to seek out useful information in learning shapes. Monkeys also ‘over-explored’ for information, persevering in information foraging even after shapes were learned. This difference may be due to humans’ familiarity with game playing and shapes formed from blocks.

Our study has several limitations. First and foremost, we used a small number of shapes and tested a single shape set. While the shape set was selected because of the opportunity to compare learning in humans to monkeys, these findings remain to be generalized to other shape sets and other contexts, such as bigger grids. Second, our results crucially rely on various assumptions that can be challenged. For example, we used inferred changepoints as a measure of learning. Changepoints in the mean or variance of choices, however, might instead be the result of other processes such as attention, arousal, or boredom. We also assumed that participants’ choice behavior can be understood as search in a restricted area (some contiguous set of tiles on the grid) and decisions to choose a non-neighboring tile reflect decisions to move to a different restricted area. Future experiments will need to manipulate confounding factors like attention, arousal, or boredom and test assumptions about restricted areas. Third, our study focuses on information in light of the pursuit of reward. While our findings suggest that information foraging can result in faster learning and higher long-term reward rates, our study does not probe the search for information for information’s sake154. Such intrinsically motivated information search remains difficult to probe, especially in nonhuman animals that require special alimentary motivation. Consequently, our results may not stand in contexts where the search for information is its own reward. In addition, fourth, we defined expected reward and expected information in terms of the outcomes from just the next choice; since information is used to maximize long-term reward rates, these two variables are confounded on our task in the long run. Finally fifth, we did not model the choice algorithms underlying decisions on the task. Our results, however, can be used to construct such algorithms; indeed the emphasis on near-term reward over information in our analyses implies that model-free reinforcement algorithms14, which myopically focus on near-term rewards, will poorly describe the behavior. We intend to use our findings as a guide to future modeling.

While many studies have investigated how animals learn their environments, few have explored this learning in environments with multiple latent features using sequences of choices and across timescales. We discovered that, in a complex sequential decision-making task with many possible shapes, humans and monkeys were both driven more by expected information than recent reward. Evidence also showed that humans and monkeys foraged for information by sampling different areas of the grid in line with predictions from foraging theory. Finally and unexpectedly, the degree to which sequences of choices in humans matched foraging choice sequences predicted the speed at which the environment was learned. This finding suggests that decision making circuitry that evolved for searching for nutrients and other environmental resources may also be used to learn about the environment using more abstract resources like information.

## Methods

Our shape search task required participants to uncover shapes (composed of multiple tiles) in a 5 × 5 grid by selecting tiles that hide parts of the shapes. Herein, ‘hits’ refers to choices that revealed part of a shape and were rewarded, and ‘misses’ refers to choices that did not reveal part of a shape and were not rewarded. Rewards were points for humans and squirts of juice for monkeys. We used five distinct shapes (Fig. 1A; 1 shape (‘H’) occupied 7 tiles, and the others occupied 4 tiles) in one set. A trial refers to the sequences of choices required to finish revealing the shape. Humans and monkeys performed as many trials as possible per session. There was one shape to uncover per trial, pseudorandomly drawn from the set of five. The color of shapes pseudorandomly varied from trial to trial. Shapes occurred at the same location across trials and participants, with a different location for each, and shapes overlapped at certain tiles. The shapes were selected on the basis of pre-existing data for the two monkeys that was collected while training them for a distinct electrophysiology study and were arbitrarily chosen by the experimenter (DLB). The set of five shapes did not change during the task, though participants were not instructed with regard to either the number of shapes or locations.

At trial start, participants made a movement to a target at the center of the screen (humans: mouse-over; monkeys: saccade), maintaining position on the fixation point (humans: 500 ms; monkeys: 500–750 ms) after which targets appeared in the middle of each remaining tile. After a second delay (humans: 500 ms; monkeys: 500–750 ms), the fixation point disappeared and participants had unlimited time to choose a target. To select a tile, participants made a movement (humans: mouse-over; monkeys: saccade) to a target at the center of the tile and held their position (humans: 250 ms; monkeys: 250–500 ms). A hit or miss was then revealed at the chosen tile location. After an inter-choice interval of 1 s, the fixation point reappeared. Participants had to reacquire fixation between every choice. This sequence repeated until the shape was fully revealed, which was followed by a 2 s free viewing period and then a 1 s inter-trial interval.

We report all methods consistent with ARRIVE guidelines. Two monkeys (M. mulatta; male; aged 7y and 9y) performed the task sitting in a primate chair (Crist) with their heads immobilized using a custom implant while eye movements were tracked (EyeLink; SR Research). There were no exclusion criteria used and we report the behavior from the first two monkeys trained on the task. All surgeries to implant head restraint devices were carried out in accordance with all rules and regulations and performed under strict Institutional Animal Care and Use Committee approved protocols (Columbia University) in fully sterile surgical settings under isoflurane anesthesia. Monkeys received analgesics and antibiotics both before and after surgeries. After recovery from surgery, animals were first trained to look at targets on a computer screen for squirts of juice. They were then trained to make delayed saccades, maintaining fixation on a centrally presented square while a target in the periphery appears. Once the central square extinguished, animals could then make an eye movement to the target. Next, they were trained on the shape search task, starting with a 3 × 3 grid, then a 3 × 4, and so on until a 5 × 5 grid. The data presented herein are from the first set of shapes both monkeys learned on a 5 × 5 grid.

Humans (H. sapiens) performed the task using a mouse on a computer. The task was programmed in javascript. They received the following instructions before the first trial:

Hello! Welcome to the battleship task!

———————————————————

The goal of this task is to identify the shapes

with the fewest number of searches. To select a shape

please hover over the central fixation point for 0.5 s

After the targets appear, remain fixed on the central point for

another 0.5 s until the blue central fixation point disappears.

Then you are free to select a target by hovering over the target in the

desired square. Once you have found a shape, the task will start over.

Thank you, and you will be debriefed at the end. Good luck!

As indicated by the instructions, human participants chose tiles by mousing over the target until the choice registered and that tile was revealed. There were no training trials and we imposed no exclusion criteria on participants based on the number of trials completed. We collected data from 42 human participants (16 m, 26f, average age 22y ± 4.9), recruited from the New York City community around Columbia University (most were Columbia University students). All procedures were approved by the Columbia University IRB and performed in accordance with all relevant guidelines and regulations. All participants gave their informed consent for the experiment.

Performance on the task was assessed by calculating the proportion of reward maximizing choices on each trial. A reward maximizing choice is a choice of a tile that maximized the expected reward given what has already been revealed on the grid. For example, suppose that no tiles have been chosen yet at the start of the trial. The reward maximizing choice, then, is to select one of the four tiles at which two shapes overlap, because given what has been revealed so far (i.e., nothing), those tiles maximize the expected reward. For a second example, suppose that the first choice in a trial eliminates all but two of the possible shapes. Then, if the two remaining shapes do not overlap at any further tiles, the choice of any tile that is part of either of the two shapes, but no other tiles, would be reward maximizing: given what has been revealed so far, only those tiles have any associated reward and they all have the same associated reward. For each choice in each trial, the associated rewards for each remaining tile were calculated, and then the participants' actual responses compared to these calculations. The proportion of these reward maximizing choices was computed for every trial and plotted in Fig. 1C.

To compare this behavior to some baseline choice strategy, we simulated two agents. The first agent chose randomly, selecting any one of the remaining tiles with equal probability after each outcome. The second agent chose semi-randomly: if a hit had not yet been made on a trial, the tile selection was one of the remaining tiles with equal probability after each outcome; otherwise, if a hit had been made, any one of the adjoining remaining tiles was selected with equal probability. This semi-random strategy performs a local area search after getting the first hit. Because of the random nature of the choice, the algorithm can sometimes result in a dead-end, in which case it once again selects any of the remaining tiles with equal probability. A similar algorithm restricted to randomly choosing a city-block neighbor also performed poorly (not depicted). For humans, simulations were performed for the maximum number of trials across all participants (that is, the participant who performed the most trials was determined, and then the simulation performed for that number of trials; maximum number of trials = 109) (Fig. 1C, top panel), and for monkeys, for the same number of trials performed by each subject (Fig. 1C, middle and bottom panels). For each simulated trial, a shape was randomly drawn from the set of five, and the simulated random chooser selected tiles until the shape was completed. The simulation was iterated 100 times, the proportion of expected reward maximizing choices computed for each trial, and then these performances were averaged and the standard error of the mean (s.e.m.) computed and plotted (Fig. 1C, first agent: blue points and error bars; second agent: green points and error bars).

We assessed participants’ learning curves for each shape by plotting the total number of choices required to finish revealing a shape across all trials for that shape to the optimal number of choices for that shape. For most choices in a trial, there was more than one optimal choice, the choice that maximized the expected reward. To determine the optimal number of choices, we averaged for each shape 1000 simulated trials of an optimal chooser, which always chose one of the expected reward maximizing tiles. These are plotted as the thick blue lines in Fig. 2 and the values were: upper-left ‘L’: 5.4400; square: 5.0690; middle-right tetris: 5.4490; lower ‘L’: 5.0480; ‘H’: 8.0400. To statistically assess changes in these learning curves, we used ordinary least squares to regress number of choices to finish revealing a shape against the trial number for that shape.

To quantify how quickly participants learned, we used a changepoint detection test on the mean and variance of choices for each separately43,44. The changepoint detection test (cf. Gallistel 2001) takes the cumulative sum of the number of choices used to finish a shape and looks for changes in the rate of accumulation. During learning, participants would take many choices to finish revealing a shape. The cumulative sum over these trials would rise correspondingly quickly. Once shapes were learned, however, the cumulative sum would rise more slowly as fewer choices are required to finish revealing a shape. At each successive trial, this cumulative sum is calculated, and the log-odds of a change in slope are computed and tested against a changepoint detection threshold (set to 4, corresponding to p < 0.001). The changepoint detection test was run on the set of trials for each shape separately. In addition to changes in the mean number of trials to finish revealing a shape, a change in the variance of the number of choices over some window may also signal learning; as participants learn, they will become less variable in the number of choices needed to finish a shape. After running the changepoint detection test on the cumulative sum of choices, the value of the best-fit line through each detected changepoint interval was subtracted from the number of choices on each trial. The result is a vector of residuals for the number of choices to finish revealing a shape. The variance over a moving window of 5 trials was then computed for these vectors, and the changepoint detection test performed on the cumulative sum of that variance (cf. Inclan and Tiao 1994). Monkeys but not humans learned the shapes over multiple days. This variance was not calculated for those sets of 5 trials that spanned days. However, changes in biological and psychological processes irrelevant to learning, such as arousal, wakefulness, and so forth, can spuriously contribute to the calculated variances. A day-to-day variance correction was performed to control for this: the variance in the 5-trial-wide window was divided by the global variance across all trials and shapes for the respective day. The cumulative variance changepoint detection test was then performed on the variance-normalized-by-day data. The end of learning was set to the last detected changepoint trial across all shapes for both mean and variance changepoint detection tests (mean last changepoint trial for human: trial 45.72 ± 3.56; M1: trial 909; M2: trial 1191). This test failed to detect a changepoint for 3 of 42 human participants, who were removed from the learning analyses as a result. All analyses using the end of learning as determined by the changepoint detection test were validated using an exponential decay threshold to determine the end of learning instead (see supplement).

The shapes on our task were formed by possible combinations of connected, filled tiles on a 3 × 3 grid that were then placed on the 5 × 5 grid. The shapes had to be at least 3 tiles large, no bigger than 8 tiles, and connections had to be in the vertical or horizontal directions (i.e., no diagonal-only connections permitted). To simplify information calculations, the set of states (nstate = 3904) was assumed to be known to the participants. While both monkeys had been trained on smaller grids using these possible states, the humans were naïve to the task and to the distribution of states. Consequently, this assumption is strictly speaking false for humans, but we adopt it for numerical reasons.

A distribution of probabilities for each shape across trials was used to compute information outcomes. A Dirichlet distribution, which possesses a conjugate prior when using a multinomial likelihood, was used to model those probabilities. The Dirichlet distribution sits on the K-1-simplex such that it can be conceptualized as a distribution of K-dimensional distributions. The dimensionality K in the shape search task refers to the number of shapes (3904). The probability of shapes at the start of trials was calculated from a vector that stored the count for each shape. At the end of each trial, the Dirichlet distribution was updated by adding one count to the vector element corresponding to that trial’s shape, and the maximum a posteriori estimate of the distribution recalculated. To ensure numerical stability and proper updating using the maximum a posteriori estimate of the Dirichlet distribution, some initial number of samples for each shape is required. The Dirichlet distribution is characterized in part using so-called inertial priors described by a series of parameters αi, …, αn in a multivariate Beta distribution. These priors are akin to the assumption that each shape has been seen some αi number of times. Here we assume all the αi are equal. For each shape, the larger these values, the larger the number of new samples needed to shift the probability mass of the distribution away from them. The maximum a posteriori estimate subtracts one from the count for each shape, and conceptually, counts cannot be less than 0, suggesting an initial count of 1 for each shape. However, an initial count of 1 would yield a division by 0, so for numerical stability some value above 1 is needed. Since shapes are either seen or not seen, conceptual considerations suggest an integer value, and 2 was chosen as the simplest, smallest, model-free numerically stable initial count for each shape.

We calculated information outcomes during a trial using a second probability distribution, which we label ‘$${\mathfrak{B}}$$’. At the start of the trial, $${\mathfrak{B}}$$ was set to the values in the Dirichlet distribution. After each choice in the trial, $${\mathfrak{B}}$$ was updated using a vector of 1’s and 0’s for the likelihoods of each shape given the outcome of that choice. If a given shape was consistent with the outcome, then the likelihood was unity; otherwise the likelihood was zero. The prior probability was multiplied by the likelihood of the outcome to yield a posterior for each state. The resulting probabilities were then re-normalized to attain the new $${\mathfrak{B}}$$, which was then used as the prior for the next choice in the trial. Information was defined as the difference in the Shannon entropy H of $${\mathfrak{B}}$$ before and after a choice outcome:

$$H_{{\mathfrak{B}}} = - \mathop \sum \limits_{i = 1}^{3904} p\left( {x_{i} } \right)*log(p\left( {x_{i} } \right)$$

for probability of ith shape p(xi). This assessment of information intake intuitively reflects how much a hit or a miss from a choice changes the participant's uncertainty about the current trial's shape.

To investigate the influence of expected rewards and expected information outcomes on choice, we performed a multinomial regression (mnrfit in MATLAB). A multinomial regression simultaneously fits choice curves relative to a reference option in the choice set. The dependent variable was the choice number in the trial (first choice, second choice, etc.). The z-scored independent covariates included trial number in session, expected rewards for the selected tile, and expected information for the selected tile. Expected reward for a tile was defined as the number of times the tile proved rewarding divided by the number of times the tile was chosen. This one-step time horizon was selected to investigate the impact of near-term pursuit of reward on learning latent features. Expected information for a tile was defined as the mean change in entropy of $${\mathfrak{B}}$$ if chosen and a hit was revealed or a miss was revealed. No interactions were included in the regression. The regression fits the following model to the data:

$$\log \, ({\text{p}}_{{{\text{i }} < {\text{ j}}}} /{\text{ p}}_{{{\text{i }} \ge {\text{ j}}}} ) \, = \,\upbeta _{0} + \,\upbeta _{1} *{\text{t}}_{{\text{n}}} + \,\upbeta _{2} *{\text{EI }} + \,\upbeta _{3} *{\text{ER}}$$

for choice numbers i and j, trial number tn, expected information EI, and expected reward ER. Choice numbers refers to the choice number in a trial, where the tile selected first by a participant is choice number 1, the tile selected second by a participant is choice number 2, and so on. In addition, we truncated the data to consider only those choices before the 10th choice in a trial; this truncation was performed in order to exclude choices that were very late in trials due to inattention, fatigue, or indolent choice strategies and that occurred very infrequently, as well as to focus on those tile choices that were motivated by the participant instead of forced by the low number of options remaining on the grid. The results of this regression are plotted in Fig. 3.

We analyzed pairs of choices by performing a mixed-effects binomial regression (fitglme in MATLAB). The dependent variable was choice of neighboring tile (1 = chose a neighbor, 0 = did not choose a neighbor). The first choice on each trial was removed for this regression (because there is no previous choice for comparison). The fixed-effect independent variables included trial number in session, choice number in trial, last choice information outcome, current choice expected information, last choice reward outcome, and current choice expected reward. The random-effect independent variable was subject identity. Fixed-effect independent variables were checked for correlation; pairwise R2 values were all at or below 0.1385 except for the correlation between information outcomes and reward outcomes (R2 = 0.5593). Significance was assessed against Bonferroni corrected p-values at 0.05. A similar binomial regression was run in the monkeys. Pairwise R2 values were all at or below 0.0950 except for the correlation between choice number and expected information (R2 = 0.2998) and information outcomes and reward outcomes (R2 = 0.4006). All variance inflation factors for humans and monkeys for the main effect covariates were less than 2.51.

To plot sequences of choices, each choice on each trial was sorted according to whether the previous choice had been a neighboring tile. Those choices that were non-neighboring were labeled 'jumps’. For each jump, the three previously chosen tiles were examined to see if they were neighbors. If so, the sequence was included in the analysis; otherwise the sequence was left out. Such a ‘lookback’ of 3 choices before a jump to characterize sequences was selected because more than 3 resulted in very few choice sequences, less power, and many fewer participants, whereas fewer than 3 included many incidental two-choice sequences. Next, every information or reward outcome from every choice (whether part of the sequence or not) was computed to find the average across all choices. Reward outcomes were determined by whether a reward was received or not. Information outcomes were determined by taking the difference between Shannon entropies of the distribution before and after a choice outcome. Finally, the average information (Fig. 5A) or reward (Fig. 5B) for each choice in the sequences as well as the global average across all choices and subjects during learning was plotted, revealing the evidence for an average intake threshold rule.

Participants’ ability to forage was assessed using a custom ‘forager score’. Foraging refers to decisions made in a sequential, non-exclusive, accept-or-reject context where options occur one at a time, foragers can accept or reject them, and rejected options can be returned to155,156. A classic foraging decision is ‘patch leaving’ where foragers must decide whether to continue foraging at a resource patch or to leave that patch to search for a new one. In the shape search task, we operationally defined a resource as a connected set of tiles, and the decision to leave a resource was defined as a jump. We considered three features of choice sequences that reflect foraging [inspired by32,36,41]. First, while participants decide to stay at a resource, the information or reward gained from choice outcomes should be above the average for the environment. Second, choice outcomes prior to deciding to leave a resource should be below the average. Finally third (and as a consequence of the first two), outcomes preceding stay decisions should be above those preceding leave decisions. We constructed a forager score on the basis of these three features. For the kth information outcome ki ranging from three choice outcomes before a jump to the outcome just before a jump, pre-jump information outcome ji, average information outcome I, and subject s, let the forager score Fs be

$$F_{s} = \, \left( {\sum \, \left( {k_{i} > I} \right) \, + \, \left( {j\hbox{-}_{I} < I} \right) \, + \, \sum \, \left( {k_{i} > \, j_{i} } \right)} \right) \, / \, 5,$$

where x > y is 1 if true and 0 if false. I was calculated separately for each participant by taking the average of all information outcomes during learning. We verified this finding by using a running average as well, where each of the outcomes in a sequence was compared to the average information outcome up to and including that sequence. A score of 5 perfectly matches a foraging pattern of choices (two choice outcomes prior to pre-jump above average + pre-jump outcome below average + two choice outcomes prior to pre-jump above pre-jump outcome). The forager score was computed for every sequence of three choices of neighboring tiles in sequence followed by a jump and averaged by subject. The final changepoint, which we used to quantify the end of learning, was then regressed against the average Fs (Fig. 6). As a comparison for this analysis, a similar score was constructed for rewards that used the reward outcomes following each choice in the sequences of three choices and the average reward outcomes across all choices.