Auditory figure-ground analysis in rostral belt and parabelt of the macaque monkey

Segregating the key features of the natural world within crowded visual or sound scenes is a critical aspect of everyday perception. The neurobiological bases for auditory figure-ground segregation are poorly understood. We demonstrate that macaques perceive an acoustic figure-ground stimulus with comparable performance to humans using a neural system that involves high-level auditory cortex, localised to the rostral belt and parabelt.

ground analysis is critical to making sense of the natural world. This is a particularly challenging problem in the auditory system where different sound objects emanating from the same spatial location have to be dynamically decoded using spectro-temporal features that are difficult to segregate from noisy backgrounds 1,2 . We assessed the perception and neural representation of auditory figure-ground stimuli in the macaque. Macaques have similar audiograms 3 , detection of tones in quiet 4 , detection of tones in noise 5 and similar pitch perception 6 to humans. Macaques also show homologous organisation of the auditory cortex that allows comparison with that in humans 7,8 . The aims of the study were twofold: to establish whether macaques can carry out acoustic figure-ground segregation like humans and to define the areal organisation for analysis in auditory cortex.
We used a stimulus in which a figure emerges from a noisy background 9 (Fig. 1). The paradigm captures a high-level acoustic process that requires grouping over frequency and time in complex sounds devoid of species-specific meaning, such as speech. The stochastic figure-ground (SFG) stimuli consist of multiple randomly generated frequency elements where a foreground object, arising from the grouping of different frequency elements over time, can only occur if coherently repeated elements are present in a number of frequency channels. A series of human behavioural and modelling experiments is consistent with a grouping mechanism based on temporal coherence between the frequencies comprising the figure 9 . Human imaging experiments using fMRI 10 and MEG 11 demonstrate activity in non-primary auditory cortex corresponding to figures that are perceived, but whether the same would hold behaviourally and neurobiologically in an animal model is unknown.

Results
Behavioural experiments tested if macaques can segregate such complex auditory figures. Two monkeys were trained to perform an active figure detection task. Proficiency on the task is indicated by the mean hit rates to the most salient condition with figures comprising 12 coherent frequency elements (M2: 0.86, M3: 0.92). The reaction time (RT) distributions show a clear peak for both subjects (Fig. 2a, (Fig. 2c) increased as a function of figure coherence. False alarm rates were constant across coherence conditions, suggesting that monkeys could competently withhold responses to stimuli without a figure. D-prime values mirror the trend of hit rates with increasing values for more salient figures (Fig. 2d). The effect of figure coherence is significant (Repeated measures ANOVA: F(4, 200) = 266.67, p = 5.84e −79 ), indicating that the number of coherent elements has an impact on the detection performance throughout sessions. Furthermore, we found decreasing reaction times and response variability with increasing saliency of the figures (Fig. 2e, 9 . Overall, the behavioural performance indicates that macaques can perceive auditory figures in noisy acoustic scenes and that behavioural performance increases with signal to noise ratio, as is the case for human listeners 9 . We acquired fMRI data from two naïve monkeys during passive exposure to the SFG stimuli. Functional imaging data were recorded before the same animals were trained in the active figure detection task.  Table 1) revealed significant results at the convexity of the superior temporal gyrus and at the rostral parts of the superior temporal plane, demonstrating bilateral involvement of higher-level auditory regions rostro-laterally to the auditory core. These results are in line with previous human studies, showing cortical responses in non-primary auditory cortex 10,11 . In order to assign a functional area to the peak BOLD response, we illustrate the Figure vs. Control contrast along with probabilistic functional maps of auditory cortical fields, which were derived from tonotopic gradients of six macaques. This comparison reveals that the main activation during a perceived figure is located in the rostral parabelt (RPB) and the rostro-lateral belt (RTL) for both monkeys (Fig. 4). The significant clusters also extend to the rostral superior temporal gyrus (STGr), the rostral core (RT), the anterolateral belt (AL) and the caudal parabelt (CPB). In addition, we find that T-values ramp up towards the rostro-lateral parts of the auditory field. Thus, we conclude that figure-ground processing happens in rostral parts of the auditory ventral stream.

Time [s]
Hit-and false alarm rates

Discussion
This work establishes the ability of macaques to carry out dynamic figure-ground segregation with remarkably similar psychometric functions to humans 9 . Neural correlates of auditory scene analysis have previously been found in primary auditory cortex for two-tone paradigms 12,13 , however, we demonstrate a system involving circumscribed parts of the rostro-lateral belt and parabelt cortex, at a high level in the cortical hierarchy in macaques [14][15][16] for complex figure-ground segregation. In line with our results, previous evidence suggests that  the most anterior regions of the ventral processing stream represent a complete acoustic signature of auditory objects 17 .
Although we demonstrate that macaques are able to carry out the behavioural task, there is a difference in the sensitivity to figures between humans and macaques. Whilst humans are reliably able to segregate figures with a coherence level of two elements 9 , performance for our two subjects was worse for figures with a higher coherence level and longer duration. However, the overall trend of detection performance as a function of coherence was the same.
We have argued that the detection of the SFG stimulus requires a mechanism that can integrate across different frequency bands in order to detect temporal coherence between these 9 . A possible mechanism of figure-ground analysis is based on single neurons in high-level cortex with inputs from combinations of units in primary cortex with narrowband or multi-peaked tuning: neuronal responses to sounds with harmonically related components are described in primate core 18 and belt areas 19 . However, a neuronal mechanism for the present results requires neurons to respond to multiple frequencies that do not have a simple mathematical relationship to each other. One imaging study suggests harmonic and non-harmonic multipeak tuning in large parts of the ventral auditory stream 20 . fMRI BOLD, however, does not allow disambiguation of such a neuronal mechanism from a population code. The necessary broadband tuning for such units is well described in the belt cortex 19,21 . Broadband responses in the parabelt are likely given they occur at a high level in the auditory hierarchy [14][15][16] but receptive fields of parabelt neurons have not been extensively characterised 22 . From first principles such neurons might be expected at a high level in the auditory hierarchy: we predict the existence of such units in the rostro-lateral belt and parabelt. In a later stage, the grouping of repeated elements and detection of the figure could cause top-down modulation in upstream brain areas like A1 in the form of neural entrainment 23 .
Previous studies have found an involvement of the intraparietal sulcus (IPS) in stream segregation 24 and figure-ground processing 10,11 . Contrary to these studies, we were not able to show a BOLD response modulation in the IPS, which could be due to the cranial implants of the animals that can lead to signal dropouts. Alternatively, a species differences in figure-ground processing cannot be ruled out.
In summary, our data suggest that a fundamental form of figure-ground analysis is perceived both by macaques and humans and relies on non-primary auditory cortex in both species. Our approach allows us to investigate grouping over frequency-time space using stimuli that are not species-specific, but which require grouping mechanisms that are relevant to the extraction of species-relevant sounds from noise. This work predicts specific neuronal responses to figure-ground analysis in rostro-lateral auditory areas that can now be investigated systematically in the macaque in a way that is not possible in humans.

Methods
All procedures performed in this study were approved by the UK Home Office  Animals. Three adult macaques (Macaca mulatta, one female) were used in this study. Both males contributed to the imaging data set. One male and one female monkey were included in the behavioural tests (see Table 2). M1 was not available for figure detection training. M3 does not have a cranial implant which is a necessary prerequisite for awake fMRI scans. Animals were kept under fluid controlled conditions. Fluid control was within ranges which do not negatively affect animal's physiological or psychological welfare 25 . Behavioural training. All subjects were naïve to the behavioural detection task. By means of positive reinforcement, we established a bar release -reward relationship. Since subjects needed to be trained in a detection task, a fixed target stimulus was paired via operant conditioning. This target was a plain figure (duration: 1000 ms, coherence: 10) without any distractor elements. After monkeys responded proficiently to the sound, we introduced the SFG background tones. The signal to noise ratio was gradually decreased by increasing the sound level of the ground signal. Subsequent to this introductory phase, the ground sound intensity was set to a fixed level (65 dB) whereas the figure sound level was incremented to give subjects an extra cue to the target. These sound level increments were then gradually decreased until subjects could detect the figures without any intensity cues. As a last step, figure coherence was manipulated in order to assess the animal's performance. The entire training period took around 8 months of daily training.

Experimental design: Behavioural paradigm.
To make inferences about the streaming ability of macaques in crowded acoustic scenes, we designed a figure detection task as a Go/No-Go paradigm. For behavioural testing, macaques sat in a primate chair (Crist Instruments) and initiated trials by touching a touch bar, placed in front of them. Two free-field speakers (Yamaha Monitor Speaker MS101 II), located at approximately 45 degree to the left and right of the animal (distance: ~65 cm from ear), delivered the stimuli at ~65 dB SPL via an Edirol UA-4FX external USB-Soundcard. The experiment was controlled with a custom made MATLAB (2015b) script, including PsychToolbox 3.0 functions through a LabJack U3-HV interface.
Before each session, a new set of stimuli was created (n = 1000). For each trial, a stimulus file was randomly drawn from this pool of stimuli. If the monkey responded correctly during the figure presentation period ('Hit'), a fluid reward was administered through a gravity based reward system. The amount of reward was dependent on the reaction time of the respective trial. Faster responses led to higher reward volumes. Inter-trial intervals (ITI) were set to 1 s. In case the monkeys missed to respond to a target, no reward was administered but a 3 s penalty time-out was imposed in addition to the ITI. Stimuli were terminated as soon as the subjects responded or after the target sound ended. Trials with stimuli containing a figure comprised 60% of all trials. The remaining 40% were catch trials (control condition) in which only the ground stimulus was presented. In these catch trials, subjects needed to hold the touch bar for the entire length of the stimulus (3 s). In case of a correct rejection of the trial (bar not released), a fixed reward was given. The amount of juice earned on those trials was greater than during detection trials, since monkeys had to hold the bar up to two seconds longer. Similar to the miss of a figure, false alarms resulted in no reward but a 3 s penalty time-out in addition to the ITI. Each behavioural sessions lasted around two hours (average number of trials per session: M2 = 1000, M3 = 873). Data were acquired, saved and analysed using MATLAB.
Experimental design: Imaging paradigm. For functional imaging scans, macaques were transferred into a custom-made, MRI-compatible scanner chair. During the session, awake animals were head restrained by means of an implanted head post. The details of the surgical procedures are described in Thiele et al. (2006) 26   VAS) equipped with a Bruker BGA-38S gradient system with an inner-bore diameter of 38 cm (BrukerBioSpin GmbH, Ettlingen, Germany). One volume transmit coil and two 4 channel receiver coils were used. A sparse imaging paradigm was applied to avoid the interfering effect of the high intensity noise generated by the MRI scanner. Shimming was performed with the MAPSHIM algorithm 27 which measures B0 field inhomogeneity to apply first and second order corrections to it. The applied sequence was a GE-EPI with 2x GRAPPA acceleration with the following parameters: TR = 10 s, TA = 2011ms, TE = 21 ms, flip angle (FA) of 90°, receiver spectral bandwidth of 200 kHz, field of view (FOV) of 9.6 × 9.6 cm 2 , with an acquisition matrix of 96 × 96, an in plane resolution and slice thickness of 1.2 mm and 32 slices. The TR duration was sufficient to avoid recording the BOLD response to the gradient noise of the previous scan. Per scan 360 volumes were acquired (of which 90 volumes baseline/silence). In total, 135 stimuli per condition (control i.e. ground only or figure) were created and presented in pseudo-randomized manner. The same stimuli were used for all scans and all subjects. Sounds were presented using Cortex software (Salk institute) at an RMS sound pressure level (SPL) of 75 dB via custom adapted electrostatic headphones based on a Nordic NeuroLab system (NordicNeuroLab, Bergen, Norway). These headphones feature a flat frequency response up to 16 kHz and are free from harmonic-distortion at the applied SPL. SPL was verified using an MR-compatible condenser microphone B&K Type 4189 (Bruel&Kjaer, Naerum, Denmark) connected by an extension cable to the sound level meter Type 2260 (same company). A structural scan was acquired at the end of each functional scanning session. Anatomical MR images are T1-weighted (T1w) images, consisting of a 2D magnetization-prepared rapid gradient-echo (MPRAGE) sequence with a 180° preparation pulse, TR = 2000 ms, TE = 3.74 ms, TI = 750 ms, 30° flip angle, receiver bandwidth = 50 KHz, an in-plane resolution of 0.67 × 0.67 mm 2 with a slice thickness of 0.6 mm. Structural scans covered the same field of view as the functional scans.
Statistical analysis: Behaviour. For data analysis, signal detection theory was applied. In total, data from 52 behavioural sessions were included in this analysis (M2: 23, M3: 29). Performance was evaluated based on hit and false alarm rates, which are the basis for d′ calculation, a measure of discriminability between responses to different stimuli. Computation of d′ values was done by using the formula below: where Z is the z-transform of hit/false alarm rate respectively, which is defined as the inverse of the cumulative Gaussian distribution (MATLAB: norminv). Since d' values take hit rates as well as false alarm rates into account, they provide a measure of all possible responses to both detection-and catch trials. Mean d′ values across all sessions for each coherence condition were the basis for the assessment of the behavioural performance. Trials with responses below 0.4 s after stimulus onset were excluded from the analysis (M2: 1.67%, M3: 1.38%). Reaction times were corrected for sound output latency of the operating system. 95% confidence intervals were calculated via bootstrapping (MATLAB: bootci, 5000 repetitions). Data were fitted with second order polynomial function. For statistical testing, data of both subjects were pooled as we were interested in the overall trend of the responses. Effects of coherence were tested across sessions with a repeated measures ANOVA for d-prime values, mean reaction times and responses variability, respectively. Normal distribution was evaluated with a one-sample Kolmogorov-Smirnov test. A Mauchly sphericity test assessed if the assumption of sphericity was violated. If that was the case, a conservative lower bound correction was applied to the degrees of freedoms and p-values of the repeated measures ANOVA.
Statistical analysis: Imaging. MR images were first converted from the scanner's native file format into a common MINC file format using the Perl script pvconv.pl (http://pvconv.sourceforge.net/). From MINC format, it was converted to NIfTI file format using MINC tools. Imaging data were then analysed with SPM12 (http:// www.fil.ion.ucl.ac.uk/spm/software/spm12/-Wellcome Trust Centre for Neuroimaging).
In the pre-processing steps, the volumes within a session are realigned and resliced to incorporate the rigid body motion compensation. Next, image volumes from multiple sessions were combined by realigning all volumes to the first volume of the first session. Then, this data was spatially smoothened using a Gaussian kernel with full-width-at-half-maximum (FWHM) of 3 mm. A standard SPM regression model was used to partition components of the BOLD response at each voxel. The two conditions, figure and control, were modelled as effects of interest and convolved with a hemodynamic boxcar response function. Next, the time series was high pass filtered with a cut-off of 1/120 Hz to remove low-frequency variations in the BOLD signal. Finally, this data was adjusted for global signal fluctuations also known as global scaling to account for differences in system responses across multiple sessions. A general linear model (GLM) analysis 28 of the combined sessions included the motion parameters, the voxel-wise response estimates and the regression coefficients. The t-values for two contrasts (Figure vs Control, Sound vs Silence) were calculated. We performed single subject inference in these two subjects. Data were thresholded at p < 0.001 (uncorrected for multiple comparisons across the brain). Results from monkey M2 survived p < 0.05 (family wise error corrected across the brain) and it showed a pattern similar to that presented here. Data were coregistered and displayed in standard space (D99) 29 .
The total number of scans for the two monkeys was as follows (M1: 12, M2: 10). Sessions with obvious large imaging artefacts, high signal differences between hemispheres and/or insufficient baseline activity in the sound vs silence contrast were not included in the analyses (M1: 6, M2: 4 sessions).
Probabilistic maps. The applied probabilistic maps are an estimate of functional areas of the auditory field in standard space (D99) 29 based on the tonotopic gradients of six macaques (not included in this study), with the probabilistic map threshold set at 0.5, equivalent to at least 3 animals overlapping in the location of the auditory cortical fields. Isofrequency lines from mirror reversals between core (A1/R) and belt areas (ML/AL) were Scientific REPORTs | (2018) 8:17948 | DOI:10.1038/s41598-018-36903-1 extended laterally to approximate the border between rostral and caudal parabelt. For each functional area, all voxels have an assigned value, representing the probability that a given voxel fell within this field. By thresholding these maps to 0.5, we made sure that each voxel is in at least 50% of the scanned population within the boundaries of the respective functional field. Data and code available on request from the corresponding authors.