## Introduction

In recent years non-human primates (NHP)s have seen increased interest as animal models for human diseases due to the advent of transgenic primates and genome-editing technologies1,2. As NHPs are closer to humans than rodents with respect to e.g. physiology, cognition, genetics, and immunology3,4,5,6,7,8,9,10, results from NHP studies investigating cognition are likely more representative for the situation in humans.

In visual neuroscience, attention, object formation, categorization, and other aspects of cognition are extensively studied. In auditory neuroscience, several studies have also used different tasks (e.g. 2-alternative forced choice, go-no go) to probe different cognitive functions (such as memory, categorization, reward processing11,12,13,14). In general, though, studies in auditory cognition are lagging behind those of visual cognition with respect to overall sophistication of methods, experiments and task complexities. One factor for this is the common observation that monkeys have been notoriously difficult to train in the auditory domain, and generally display a bias towards vision. For example, it has been shown that baboons can easily learn to locate food items based on visual but not auditory cues15. Among other results, this surprising failure at such a seemingly simple auditory task has led the authors to suggest that inferential reasoning might be modality specific.

However, investigations into auditory capabilities and cognition increase in scope as NHPs have become genetically tractable organisms1,2,16,17,18. Notably, the common marmoset (Callithrix jacchus) has become a valuable model for biomedical research in general and the neurosciences in particular19,20,21. Factors such as the relative ease of breeding, early sexual maturation and short life span22,23 have contributed to the rapid generation of genetic models of human mental and neurological diseases in marmosets1,24,25,26. While generally marmoset training is lacking behind the sophistication of cognitive NHP experiments traditionally performed with macaques, auditory capabilities of marmosets have been investigated extensively27,28,29,30,31,32. Furthermore, marmosets have now also become the go-to NHP model for hearing loss and cochlear implant research33,34,35,36. In the near future many more transgenic primate models will be developed which requires extensive phenotyping, as is standard for rodent models37. Phenotyping will need to investigate large number of subjects in a standardized and experimenter/observer-independent manner38,39,40,41,42,43,44. In addition, increased awareness for species-specific ethical demands asks for refinement of experimentation techniques as much as possible45,46. This has led to efforts developing home-cage, computer-based cognitive training of NHPs focusing on the visual domain47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63.

To achieve comparable efforts in the auditory domain, there is a need for automatic, unsupervised cage-based training and testing of auditory tasks. Towards this goal, we built a standalone wireless device for auditory training and testing of common marmosets, directly in their own cage. The system, termed marmoset experimental behavioral instrument (MXBI), is mostly comprised of off-the-shelf or 3d printed components, is entirely programmed in Python, and based on the Raspberry Pi platform, for maximum flexibility of use, openness, and to allow for easy adaptation by others. The MXBI is set up with a server/client configuration in mind; and capable of animal tagging by means of radio-frequency identification (as in rodent systems64), which ultimately allows scalable, standardized, automated, and unsupervised training and testing protocols (AUT in short, from ref. 48) in socially housed animals. Moreover, the MXBI and the procedures we describe contribute to the efforts of refining cognitive and environmental enrichments of NHPs in human care. Further, we report results from a set of four experiments: (1) an algorithm-based procedure for gradually and autonomously training naïve animals to the basics of a 2-Alternative-Choice task (2AC visual task); (2) an Audio-visual association experiment where a conspecific call is contrasted to an artificial acoustic stimulus; (3) a generalization experiment assessing the flexibility of the acquired discrimination behavior with other stimuli; (4) and a psychoacoustic detection experiment for quantifying hearing thresholds in a cage-based setting. We show that marmosets can be trained to flexibly perform psychoacoustic experiments on a cage-based touchscreen device, via an automated and unsupervised training procedure that requires no human supervision and does not rely on fluid or food control, nor social separation.

## Results

In this study 14 adult common marmosets (Callithrix jacchus) of either sex and housed in pairs participated across one initial training phase and four autonomous cage-based experiments. Animals were generally trained in pairs on auditory tasks with a single MXBI attached to the animals’ home cage and without fluid or social restrictions (Fig. 1a). Aside from the initial training (see below) all sessions ran autonomously, while an RFID module identified the animals and an algorithm controlled the individualized, performance-based progression in difficulty (see methods: Automated unsupervised training (AUT)).

### Initial training

The goal of the initial training was to instruct naïve animals to interact with the touchscreen to receive liquid reward (Arabic gum or marshmallow solution) from the device’s mouthpiece. The training was divided into three sequential steps: first, habituation to the device (supplementary video 1); second, forming a mouthpiece-reward association (supplementary video 2), and finally, a touch-to-drink phase (supplementary video 3.1 and 3.2). All animals started exploring the device from the very first session. During the touch-to-drink phase, a mesh tunnel was introduced inside the device (Fig. 1a), to allow only one animal at a time inside the MXBI. Animals were encouraged to enter the tunnel and reach the touchscreen by placing small pieces of marshmallows or arabic gum along the tunnel, on the mouthpiece, and on the screen. After the initial training was concluded (mean = 6 ± 1.4 sessions, Table 1), animals were introduced to the automated procedure gradually bringing them from naïve to experienced in discrimination as well as detection-based psychoacoustic tasks.

### General engagement on the MXBI across all autonomous experiments

Individual animals engaged with the MXBI in different amounts with the median number of trials per session varying between 31 and 223. On average 116 trials per session (IQR = Q3-Q1 = 192) were performed (Fig. 1b, Table 1). While half of the animals had less than 10% of sessions without a single trial (median = 10.7%, IQR = 16.8%) two animals displayed more than 30% of sessions without performing a trial. On average 100 sessions were conducted per animal and 14 of those sessions had 0 trials (Fig. 1b). Controlling for session duration, we found no significant correlation between the total number of trials performed by each animal and session number (Partial Pearson correlation controlling for session duration; adjusted r2 = 0.05, p-value: 0.1, N = 802; CI = −0.01, 0.13, Fig. 1c), suggesting that the level of engagement remained consistent across sessions. Qualitatively, animals tended to engage consistently throughout a session as indicated by the distribution of trial onset times (Fig. 1e). Consequently, the median time point at which half of the trials were performed was 0.52 of the session’s duration (Fig. 1d).

### Automated unsupervised training (AUT)

An automated and unsupervised training protocol (AUT48) was implemented to train naive marmosets at their own pace on the basics of a 2AC visually guided task. In order to identify the appropriate parameters upon which to build such autonomous procedure we first designed and tested multiple AUT versions with a subset of 9 animals (described in supplementary tables S1 and S2). The resulting final versions of the protocols (AUTs 8, 9, and 10), were then tested with 4 naïve animals (animals f, k, c, and d). The AUT procedure was comprised of 4 milestones—(1) decrease of the size of a visual stimulus (trigger) to be touched for reward, (2) change of position of a visual stimulus, (3) introduction of sound and delayed presentation of a visual target, (4) introduction of a second visual target as a distractor—that unfolded through a total of 48 dynamic steps (Fig. 2; Fig. S4C). During each session the transitions between steps and milestones were based on the animal’s performance in a sliding window of 10 trials (hit rate of > 80% to advance, <= 20% to retreat; Fig. S4D). Figure 2c shows the hit rate across individual steps and milestones for the 4 naïve animals that only performed the final versions of the AUT. While the procedure was designed to encourage a smooth transition from step to step, certain steps (and thus milestones) required more trials to be accomplished. As a consequence, the hit rate calculated across animals varies substantially as a function of AUT step (Fig. 2c). Due to animals learning at different paces and performing different number of trials, we quantified the progression through the AUT as a function of the percentage of total trials completed by each animal (Fig. 2d). This allowed us to visualize and compare learning progress across animals with inherently different working paces on a common frame of reference. Both the total amount of trials (expressed by line thickness in Fig. 2d) needed to complete the AUT and the learning curves throughout the AUT vary substantially across animals (Fig. 2e) in the middle portion of the AUT, during which the stimulus changed position on the screen and an acoustic stimulus was introduced. Starting from the introduction of sound (milestone 3) we introduced timeouts (gray screen) to provide further feedback on wrong trials. Analysis of inter-trial-intervals (ITIs) trials revealed shorter average ITIs after correct vs. wrong trials suggesting an effect of timeouts on animal behavior (Fig. S3 and Table S4).

### Audio-visual association

Next, we tested whether animals would generalize from the visually guided 2AC task introduced via the AUT procedure to an acoustically guided 2AC discrimination. In this experiment animals were required to discriminate between a conspecific juvenile call (in the following referred to as voc), and a pure tone (simple train—sTr—chosen for individual animals from a range between 1.5 and 3.5 kHz), by selecting one of two visual stimuli permanently associated with each sound (supplementary Video 4). 5 out of 9 animals successfully learned to discriminate between the sTr and the voc by selecting a geometric pattern or a conspecifics face, respectively (Fig. 3a, c). The remaining 4 animals performed at chance level. To disentangle if these animals were unable to solve the task or maybe were unwilling to perform above chance, we devised a 3 alternative-choice (3AC; upon sound presentation animals had to choose between 3 visual symbols, see methods) version of the same task (Fig. 3b, c) and tested 2 of these animals and 2 additional animals who had failed a different control condition (see supplementary material: Artificial Discrimination, Figs. S1, S2). In the 3AC task, all 4 animals performed the task significantly above chance (Binomial test, pot-hoc corrected for multiple comparisons; Table 2). Taken together these results demonstrate that 9 out of 11 animals learned the Audio-visual association. The remaining two animals that did not learn the 2AC discrimination were assigned to a different project and were not tested on the 3AC version. Additionally, 7 out of 9 animals who accomplished the discrimination task exhibited significantly longer reaction times in responding to the target in voc vs sTr trials (Fig. 3d; Table 2), indicating that the animals behaved differently for different acoustic stimuli.

### Psychoacoustic assessment of stimulus thresholds

Last, we addressed whether the MXBI can be employed for psychoacoustics. We chose to investigate hearing thresholds in a vocalization-detection task and towards this goal trained three animals (animal a, b, and d). In this experiment animals that already knew the association between the acoustic and corresponding visual stimuli (see above: section “audio-visual association”), were now trained to associate the absence of the vocalization with the visual stimulus for the sTr (Fig. 5). The method of constant stimuli was employed by randomly selecting the sound level from a set of values between 0 and 80 dB SPL. The animals were required to report the presence or absence of the vocalization by touching the marmoset face (visual stimulus coupled with the voc) or the triangles (visual stimulus coupled with silence), respectively. Note that due to the nature of the task, reward to the animals for stimuli in the range between 15- and 45-dB SPL was provided regardless of the animal’s choice. This was instrumental to prevent frustration and thus disengagement from the task when the acoustic stimulus was presented at amplitudes presumably close to the animal’s hearing thresholds. In contrast, reward was dependent on the animals’ choice for stimuli at and above 60 dB SPL and at 0 dB SPL. The aim of this reward scheme, illustrated in Fig. 5a, was to encourage the animals to use the triangles and the marmoset face as yes/no options for the presence/absence of the acoustic stimulation. After two to three sessions with only high amplitude stimuli (70 dB SPL) to stabilize the animals’ discrimination performance at 75% or above, test sessions commenced (three for animal d and four sessions for animals a and b—Fig. 5b). The estimated hearing threshold for the vocalization stimulus (mean 37.3 dB SPL; 36 for animal a, 49 for animal b, 27 for animal d) was below the background noise of the facility of 60 dB SPL (measured inside the MXBI with a measurement microphone and amplifier, see methods; spectrograms of 3 representative 1 min long recordings are shown in Fig. 5c).

## Discussion

In this study, we report results from four sequential experiments conducted with a stand-alone, touchscreen-based system—termed MXBI—tailored to perform training as well as psychophysical testing of common marmosets in auditory tasks. Animals involved in this experiment operated the device with a consistent level of engagement and for a prolonged time, directly in their own housing environment, without dietary restriction or social separation. All animals navigated an automated, unsupervised training procedure with ease and at their own pace, going from naïve to experienced in a visually guided discrimination task. In a following audio-visual association experiment, nine out of eleven animals further acquired proficiency in an acoustically guided 2AC or 3AC discrimination task. Animals also quickly learned to flexibly discriminate three novel sounds they had never encountered before in a generalization experiment. Finally, we assessed the hearing thresholds of 3 animals with a spectro-temporally complex sound under potentially distracting auditory conditions. Our results indicate that: (1) marmoset monkeys consistently engage in various psychoacoustic experiments; (2) while performing enough trials and at high performance to allow psychometric evaluations; (3) in a self-paced manner; (4) without the need of dietary restriction or separation from their peers; and (5) with high degree of training flexibility.

### Home-cage training of naïve animals

Finally, because we instructed tasks that are typical in cognitive neuroscience and animal cognition (namely a two or three alternative-choice and a detection task), we believe that similar results would be achieved in training as well as testing other sensory or cognitive domains.

### Training flexibility of marmosets

With the exception of two animals who were assigned to a different project and could not be trained further, all animals were successfully trained and tested in audio-visual association experiments reported here. It is important to note that while two animals—a and b—readily transferred the knowledge acquired in the visually guided discrimination (Automated Unsupervised Training) to quickly learn the acoustically guided discrimination (audio-visual association), the remaining seven animals required a substantial amount of trials to reach the same level of proficiency. Three animals out of the remaining seven also rapidly generalized the acquired discrimination to novel acoustic stimuli at a comparable rate to animals a and b. Therefore, while the initial transition from the visual to the acoustic domain occurred at variable speed, all tested animals showed a comparable level of flexibility in generalizing to novel stimuli. Finally, all three animals tested in the psychoacoustic assessment, quickly learned to reinterpret the discrimination as a detection task as soon as the reward scheme was adjusted. This allowed for a systematic psychoacoustic assessment of the sound intensity required to detect a vocalization under conditions with background noise.

Together, our results suggest a high degree of training flexibility of common marmosets in general and the auditory modality in particular. Specifically, marmosets can: (1) transfer acquired rules from the visual to the acoustic domain; (2) rapidly learn to discriminate novel acoustic stimuli and (3) flexibly reinterpret a discrimination task as a detection task.

### Cognitive hearing in marmosets

The success of the acoustic experiments presented in this study could partly be due to intrinsic properties of the stimuli employed based on the naturalistic connotation in both the visual and the acoustic domain of the juvenile vocalization and juvenile marmoset face association. This ‘natural association’ might then also support the association of the respective other stimuli. Our failed attempts, detailed in the supplementary material, indeed demonstrate the difficulty in having marmosets associate artificial stimuli across the auditory and visual modality. The guiding strategy was that additional properties of the stimuli should match across modalities to support crossmodal association and considered successful concepts from training of rodents and ferrets73,74. For example, we presented auditory and visual stimuli together with a reward, or a timeout screen, in a temporally overlapping fashion which leads to strong associations of stimulus components in rodents. Also, the sound was presented from the speaker on which side the correct visual response indicator was located. This has been shown to be a strong cue for ferrets to guide choice towards the respective sound direction. In stark contrast, none of these approaches were successful in marmosets.

Results from the generalization experiment indicate that animals could quickly and flexibly learn to discriminate novel auditory stimuli. On the other hand, when two different types of vocalizations were contrasted, only two animals out of 4 performed above chance. Taken together these results indicate that (1) vocalizations might carry a distinctive meaning to the animals that can be exploited to train common marmosets on various psychoacoustic tasks; and (2) the use of a combination of naturalistic and artificial sounds is more likely to instruct marmosets in performing psychoacoustic tasks above chance level.

### Psychoacoustic assessment of marmosets in the home enclosure

Performing auditory psychophysics directly in the animals’ colony poses an acoustically challenging environment due to the uncontrolled background noise. The sound pressure needed in order to detect a vocalization of a juvenile marmoset in a cage-based setting—37.3 dB SPL—was below the sound level of the facility’s background noise—~60 dB SPL. This might be explained by the adaptation of the auditory system to background sounds which has been documented along the auditory pathway75,76,77,78,79 and has been suggested to optimize perception to the environment76,77. Additionally, the juvenile vocalization might have been less affected by background noise (mostly driven by ventilation and marmoset vocalizations) as it minimally overlaps the sound spectrum typically encountered in our colony of adult animals. Nonetheless, our data show that NHP’s psychoacoustic training and assessment is feasible within the animals’ home enclosure similar to chair based psychophysics29. While measurements of hearing thresholds in more classical controlled settings are essential to understand auditory processing and sensitivity, the investigation of audition in more naturalistic environments could provide a closer estimate of real-world hearing capabilities. This might be particularly relevant for auditory processes and mechanisms that involve higher-level, top-down, cortical influences80,81,82 and thus are more susceptible to the influence of environmental contextual factors. Environmental sounds produced by conspecifics, for example, could affect how task-relevant sounds are encoded, processed, and interpreted by marmosets that heavily rely on acoustic communication to cooperate, live together, and survive83.

### Towards a high-throughput pipeline for auditory neuroscience

In conclusion, all of these aspects are to be considered when establishing a successful high-throughput pipelines (across various fields of cognitive neuroscience) because together they ultimately add up to create automated high-throughput protocols for integrating advanced cognitive and behavioral assessments with physiological data recordings38.

### Autonomous devices as cognitive enrichment

Throughout our experiments we found that animals consistently interacted with the device regardless of their performance. In certain occasions animals performed thousands of trials at chance level, across several weeks, despite no social or fluid restriction were applied. While this might seem counterintuitive, we argue that from the animals’ perspective our approach, coupled with the appeal of the liquid arabic gum that the device delivered, represents a form of enrichment68,69,70. From a psychological standpoint, cognitive enrichment strategies exercise what is known as competence, namely the range of species-specific skills animals employ when faced with various challenges. This, in turn, promotes the sense of agency, described as the capacity of an individual to autonomously and freely act in its environment91. Promoting both competence and agency has been proposed to be crucial for the psychological wellbeing of captive animals because: (1) animals can better cope and thus better tolerate captivity; and (2) animals can exercise species-specific cognitive abilities that have little opportunity to be expressed in captivity68,92.

### Study limitations and caveats

Several animals in the audio-visual association tasks performed at chance level for several thousands of trials. Receiving a reward in half of the trials might be a successful strategy for animals that are not constrained, isolated, or fluid/food restricted. Under these conditions it is unclear whether animals will attempt to maximize their reward—as has been reported in studies where food or fluid regimes are manipulated93,94 but see95—or are satisfied with chance performance. An animal that is satisfied performing at chance for a certain task will naturally not ‘learn’ even though it might cognitively be able to. In line with this interpretation, animals that performed at chance level in a 2AC version of an auditory discrimination task, successfully performed the auditory discrimination when the overall chance level was reduced from 50 to 33% by employing a 3AC version.

Our data demonstrate flexibility of auditory training using natural stimuli and lay the groundwork for further investigations e.g. testing categorical perception of vocalizations by modulating the spectral content of the stimuli used. However, a caveat of our work is that our approaches were not successful in training marmosets on discriminating artificial sounds consistently (see supplementary materials). Among other potential explanations, we attribute this difficulty due to the introduction of auditory cues relatively late in training. This might have biased animals to focus on the visual domain—which is considered the dominant sense in primates96,97—while ignoring other cues. Future studies should therefore explore alternative approaches to train arbitrary acoustic discriminations potentially by introducing reliable auditory cues very early in training.

## Methods

All animal procedures of this study were approved by the responsible regional government office [Niedersächsisches Landesamt für Verbraucherschutz und Lebensmittelsicherheit (LAVES), Permit No. 18/2976], as well as an ethics committee of the German Primate Center (Permit No. E1-20_4_18) and were in accordance with all applicable German and European regulations on husbandry procedures and conditions. It has to be noted, however, that—according to European regulations and implemented in German animal protection law — the procedures described in this study can be considered to be environmental enrichment.

### Animals

A total of 14 adult common marmosets (Callithrix jacchus) of either sex (see Table 1) were involved in the experiments carried out in the animal facility of the German Primate Center in Göttingen, Germany. Some of the animals were prepared for neurophysiological and cochlear implant experiments. Animals were pair housed in wire mesh cages of sizes 160 cm (H) × 65 cm (W) × 80 cm (D) under a light-dark cycle of 12 h (06:00 to 18:00). Neighboring pairs were visually separated by opaque plastic dividers while cloths hung from the ceiling prevented visual contact across the room. Experimental sessions occurred mostly in the afternoon and without controlled food/fluid regimes or social separation from the assigned partner. Liquid arabic gum (Gummi Arabic Powder E414,1:5 dissolved in water; Willy Becker GmbH) or dissolved marshmallows (marshmallow juice, 1:4 water dilution) was provided as a reward by the touchscreen device for every correct response in the various experiments. Marshmallow or arabic gum pieces, stuck to the touchscreen, were used during the initial training phase.

### Apparatus

The marmoset experimental behavioral instrument (MXBI) is directly attached onto the animals’ cage and measures 44 cm (H) × 26 cm (W) × 28 cm (D). The device is internally divided into three sections (Fig. S4A). The electronics compartment on top contains: a Raspberry Pi 3B + (raspberrypi.org); a RFID module with a serial interface (Euro I.D. LID 665 Board); two peristaltic pumps (Verderflex M025 OEM Pump), one on each side; a camera module (Raspberry Pi wide-angle camera module RB-Camera-WW Joy-IT); and a power bank (Powerbank XT-20000QC3) through which 5 and 12 V (max 2.1 A) was provided to the whole system. In our setup and with our tests, the power banks last up to 8 h before the battery is depleted allowing for continuous training or testing during most of the waking hours of the colony. We chose the Raspberry Pi single board computer instead of more commonly used tablet PCs88,98 for ease of interfacing various external devices. Towards this requirement the Raspberry Pi has various general-purpose input output capabilities allowing to integrate a wide variety of external hardware components such as microcontrollers, touchscreens, etc. with standard communication interfaces (SPI, I2C, I2S). Additionally, new MXBIs can simply be set up by copying the content of the SD card of an existing device into the SD card of the new device. The behavioral chamber in the middle (internal dimensions: 30 cm (H) × 22 cm (W) × 24 cm (D)) hosts: a 10 inch touchscreen (Waveshare 10.1”HDMI LCD [H], later sessions contained a 10“ infrared touchscreen attached to the LCD screen, ObeyTec); a set of two speakers (Visaton FR58, 8 Ω, 120–20,000 Hz) for binaural acoustic stimulation; a horizontal reward tube with custom-made mouthpiece (placed at 3 cm from the screen but variable between 2 cm and 5 cm); the coil (or antenna) of the RFID and a cylindrical mesh to prevent more than one animal to be inside the device at the same time (Fig. 1a). Finally, at the bottom of the device, space is left to accommodate a removable tray to collect and clean waste. Hinges on one side allow the device to be opened from the back if cleaning or troubleshooting is needed (Fig. 1a Left). The MXBI can be anchored to the front panel of the animal’s cage via custom designed rails welded to the cage. A removable sliding door at the front panel allows animals to access the MXBI when attached. A Python3 based software (Python 3.5.3 with the following modules: tkinter 8.6, numpy 1.12.1, RPi.GPIO 0.6.5, pyaudio 0.2.11) running on the Raspberry Pi records all interaction events (screen touches, RFID tag readings and video recording), manages stimulus presentation (acoustic and visual), controls the reward system and finally backs up the data automatically to a server via wireless local network connection (Fig. S4B).

### Procedure

Behavioral training and testing sessions were started by connecting the Raspberry Pi and LCD display to power which initiates booting. After booting, a custom script with a series of preconfigured commands was automatically initiated to: (1) connect the device to a central server for automatic, recursive, data logging, as well as main database access; (2) start the local camera server for remote monitoring and video recordings (Fig. 2b); (3) automatically launch the experimental task when needed. The fluid reward was manually loaded in each device and the pump was primed. The device was then attached to the cage and the sliding door in the front panel removed for the duration of the session. At the end of the session, the sliding door was placed back between the device and the cage so that the device could be detached, cleaned, and stored. The touchscreen surface and the behavioral compartment were thoroughly cleaned to remove odors and other traces. Hot water was used daily to clean the reward system to prevent dried reward from clogging the silicon tubes and mouthpiece. The entire process requires a single person around 35 min (15 for setting up and 20 for taking down) with six devices.

### Sessions

In order to operate the touchscreen at the opposite end from the MXBI’s entrance, the animals are required to go through the opening on the front panel and the mesh cylinder (Fig. 1a). Crossing the antenna inside the mesh cylinder identifies animals via their RFID transponder (Trovan ID-100A) implanted between the animal’s shoulders for husbandry and identification reasons. Standing up inside the mesh places the animals’ head 3 cm above the mouthpiece and 4–5 cm away from the screen, directly in front of a cut out in the mesh of 3.5 × 8.5 cm (HxW) through which the touchscreen can be operated (Fig. 1a). Throughout each session, animals were regularly monitored by the experimenter from a remote location (approximately every 15 min). Additionally, videos from most sessions were recorded and stored. Fluid (either water or tea) was available ad libitum to the animals within their home cage but outside the MXBI. Solid food was provided to the majority of the animals before, after, and during the session, depending on husbandry and/or veterinary requirements.

Throughout the experiments, animals never left their home cage. With the exception of animals a and b, that where pilot subjects and underwent a different initial procedure, all animals were first trained manually to operate the device at a basic level by means of positive reinforcement training and shaping techniques (see methods section: initial training). Afterwards, all animals where guided by an unsupervised algorithm through a series of preconfigured training steps (see section Automated unsupervised training (AUT)) to acquire basic proficiency in a standard 2AC discrimination task. The animals’ discrimination proficiency was then tested and refined in a next experiment in an acoustically guided discrimination task (see section Audio-visual association). In a third experiment, the acoustic stimuli were replaced with novel stimuli and the animal’s ability to generalize was assessed (see section Generalization to novel stimuli). Last, we developed a psychoacoustic detection task to quantify the animal’s hearing thresholds (see section Psychoacoustic assessment). It is important to note that not all animals took part in all experiments either because some animals were assigned to different projects or were not always available due to the requirements of different experiments.

### Initial training

The goal of the initial training procedure was to instruct naïve animals to use the touchscreen. To this end, this training was divided into three sequential steps: first, habituation to the device; second, forming a mouthpiece-reward association and finally, a touch-to-drink phase. During the first two steps no wire mesh cylinder was placed inside the MXBI. Unlike the remainder of the training, all initial training required the constant surveillance of the experimenter, to remotely access and control the screen of the device from another computer to shape the animal’s behavior while monitoring the video feed. The measured round-trip delay between observing the behavior and effectively delivering the reward was approximately 400 ms plus an additional response latency of the observer. Together, we believe that this delay should be sufficient for stimulus-response integration and association99. The initial training lasted on average 6 (±2) sessions and was routinely completed within 2 weeks. With the exception of animals a and b, all animals underwent the initial training.

#### Device Habituation

During this first step the device was attached to the cage without the mesh cylinder, to allow the animals to freely explore the behavioral chamber (see supplementary video 1) in sessions lasting on average 40(±20) minutes. Before switching to the next step, the experimenter ensured that both animals would show interest and no aversion towards the device (e.g. walking towards and not away from the device). The number of sessions needed to observe this behavior varied between 1 and 2.

#### Mouthpiece-reward association

Following the habituation, drops of reward of variable magnitude (between 0.3 and 0.5 ml) were remotely triggered by an experimenter in order to direct the interest of the animals towards the mouthpiece (see supplementary video 2). Presumably due to the sudden occurrence of the pump sound while rewarding, the interest towards the MXBI for some animals slightly decreased. To overcome this issue and to increase the likelihood of animals interacting with the device a number of small marshmallow pieces were placed randomly over the mouthpiece. After all pieces were consumed and the animals left the MXBI the experimenter closed the sliding door to place new pieces. Once the animals showed interest in the mouthpiece in the absence of the reward, the association was considered established and the next phase started. This step required between 1 and 5 sessions, with each session lasting 30–60 min.

#### Touch-to-drink phase

The aim of this step was to teach the animals to actively seek the reward, by triggering the touchscreen. In order to achieve such behavior efficiently and to make sure the animal used the hand and not e.g. their mouth (which was observed in pilot experiments) to touch the screen, a mesh cylinder was placed inside the device. In turn, this restricted access to one animal at a time, and improved the efficiency of the RFID identification. Additionally, small pieces of marshmallows were placed on the screen within the triggering area, to encourage the animals to retrieve the marshmallow pieces and thereby touch the screen. When all pieces were consumed and the animal had left the MXBI the experimenter closed the sliding door to place new pieces on the screen and resumed the session. While the marshmallow pieces where collected, fluid reward was provided, triggered either remotely by the experimenter or by the animals themselves touching the stimulus on the screen. This procedure successfully allowed all animals to switch from reaching to retrieve the marshmallows to simply touching the screen to trigger fluid reward (see supplementary video 3.1 and 3.2). After 5–10 consecutive reaching movements towards the screen in the absence of marshmallows, the behavior was considered acquired and the initial training concluded. Between 1 and 4 sessions (each lasting 60 min on average ± 10 min) were necessary to finish the touch-to-drink phase.

### Audio-visual association

The audio-visual association experiment starts when an animal reaches step 50 in the Autonomous Training (Fig. 3). Contrary to the AUT, no visual cue could be used to correctly identify the target of a given trial. Here animals had to solely rely on auditory cues to obtain a reward above chance level. In this experiment, no AUT algorithm was employed and therefore the trial structure and sequence remained unchanged throughout. This experiment consisted of a two-alternative choice task (2AC), where only one of the two available options was the correct one and the animal’s ability to distinguish the options was assessed from the animal’s relative frequency of choice. We implemented two variants of this task, a 2AC and a 3AC, plus a control condition (see supplementary material). Both variants employed the same stimuli of the Autonomous Training with added visual distractors in the 3AC variant which had no sound associated and were not presented as target, but always as a distractor. While touching the target of a given trial was rewarded, touching a distractor resulted in 5 s (later sessions) timeout indicated by a gray screen during which no new trial could be initiated and further touches were ignored. On the contrary, after a correct response, a new trial could be started 0.8–2.5 s after reward delivery. A detailed timeline of an example trial from this task is shown in Fig. S4E and a video of an animal performing few trials in the 2AC variant is available in the supplementary materials (Video 4). Animals who did not perform above chance on the 2AC variant were assigned to the 3AC variant. The 3AC variant was used to lower the chance of obtaining a reward randomly at any given trial from 50% to 33%. Two animals who performed at chance level in the 2AC were assigned to a different experiment and could not be tested on the 3AC.

### Psychoacoustic assessment

In order to assess the animals’ hearing thresholds, we devised a simple detection task based on the discrimination task used before. In this task animals were trained to choose the gray triangles (previous visual stimulus of sTr) to report the absence of the vocalization (i.e. silence). Once the behavior was stable (after two sessions) and based on the measured background noise of the facility (60 ± 5 dB SPL, see below and Fig. 5c) we set the sound intensities to 0, 15, 30, 45, 60, 70, 80db SPL for the vocalization. Given that some of these intensities were below the background noise of the facility, all trials with intensities between 15 and 45 dB SPL were rewarded regardless of the choice of the animal (Fig. 5a). Moreover, vocalization trials at 0 dB SPL were rewarded if the triangles were selected (visual stimulus for the silence). This was instrumental to first account for both type of trials (silence and vocalization) presented at 0 dB SPL, and second to effectively establish the task as a detection rather than a discrimination task. Finally, all sessions were performed in the afternoon, from 1 pm to 4.30 pm, when the colony’s background noise was the lowest with feeding and personnel’s activity occurring mostly in the morning.

In order to measure the background noise level of the facility inside the MXBI a microphone (Bruel And Kjaer Type 4966 1/2-inch) was placed at the marmosets' ear level and a measuring amplifier (Bruel And Kjaer Measuring Amplifier Types 2610) visualized the sound pressure level. The sound output of the two devices used to gather hearing thresholds (1 for animals a & b, 1 for animal d) were further calibrated inside an insulated sound proof chamber. An amplifier (Hifiberry amp2) coupled to the Raspberry Pi produced the audio signal, while a measuring amplifier (Bruel And Kjaer Measuring Amplifier Types 2610) and a microphone (Bruel And Kjaer Type 4966 1/2-inch) placed at the marmoset ear level pointing towards one speaker, acquired the sound output. Additionally, an oscilloscope (Rigol DS1000Z), attached to the output lines of the amplifier, measured the voltage. We were able to corroborate the step size (0.5 dB SPL) of the amplifier by sampling 5 different frequencies (0.875 kHz, 1.75 kHz, 3.5 kHz, 7 kHz, 14 kHz) at 10 different sound pressure levels (100 dB, 95 dB, 90 dB, 85 dB, 80 dB, 75.5 dB, 70 dB, 65.5 dB, 60 dB, 50 dB). We found a stable and accurate correspondence between the values provided to the amplifier, the sound pressure levels measured by the measuring amplifier, and the voltage values measured by the oscilloscope.

### Data treatment and Statistics

Data acquisition, processing, analysis, and statistical testing were performed in Python 3.5.3 and 3.9. Statistics and significance tests for Figs. 14 were calculated via the packages scipy102,103 and numpy104, co-installed upon installation of the package seaborn. An alpha level of less than 0.05 was considered significant. Data formatting and visualization for the same figures as well as for Table 1 was achieved with the packages pandas105 and seaborn (seaborn.pydata.org). Hit rate’s significant difference from chance (Fig. 3c) was assessed with a Binomial test; while reaction time differences between the two presented auditory stimuli (Fig. 3d) were tested for significance with a Kruskal-Wallis test by ranks. Both tests were adjusted post-hoc for multiple comparisons with Bonferroni correction (corrected alpha = 0.0019, from the python module statsmodel.stats.multitest.multipletests). In Fig. 2d and Fig. 3a, b, the variable “percentage of trials” on the abscissa was used to achieve a shared and standardized axis on which multiple animals could be compared and visualized against each other, irrespective of the total amount of trials each individual performed. The assumption behind this choice was that learning occurs through similar mechanism across individuals, but unfolds through a different amount of trials that depend on each animal’s engagement level. The resulting process of standardization attenuated the inter-individual variability between animals for parameters such as steps of the AUT (Fig. 2c) and Hit Rate (Fig. 3a, b, and 4b).

Psychometric function estimation was achieved with the python module psignifit106 set to fit a cumulative normal sigmoid function, with all parameters free and with 95% confidence intervals. The resulting function can be expressed as follows:

$$\psi (x{{{{{\rm{;}}}}}}m,w,\lambda ,\gamma )=\gamma +(1-\lambda -\gamma )S(x{{{{{\rm{;}}}}}}m,w)$$
(1)

where m represents the threshold (level at 0.5), w represents the width (difference between levels 0.5 and 0.95), λ and γ represent the upper and lower asymptote respectively (Eq. (1) in ref. 106).

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.