Neuroimaging data1,2,3 have shown the existence of audiomotor multisensory neurons in the posterior region of the superior temporal sulcus (pSTS) and in the middle temporal gyrus (MTG) that respond to the sounds and visual images of objects and animals; these regions also respond to letters and speech sounds and labial movements4. In addition, these regions are activated more strongly by audiovisual stimuli than by unisensory stimuli, thus suggesting multisensory integration of inputs from two modalities5. This multisensory integration is particularly strong for linguistic stimuli, in that an incongruent visual stimulus can qualitatively change the auditory perception at the level of the auditory cortex6,7,8. In monkeys, audiovisual “mirror” neurons have been discovered in the ventral premotor cortex9,10. These neurons discharge both when the animal performs a specific action and when it either hears the sound associated with that action or sees the action.

With regard to the timing of this integration, in an electrophysiological study by Senkowski11, processing of multisensory (audiovisual) and unisensory (auditory or visual) stimuli were explored using naturalistic water splash sounds and corresponding visual images. They found an early effect of multisensory integration (120–140 ms) over the posterior brain areas; this was followed by later (210–350 ms) activity involving (among other areas) the temporal cortex (MTG and STG).

With the exception of direct neurophysiological evidence of “audiovisual mirror neurons” (in monkeys), most, if not all, neuroimaging studies of multisensory interactions in humans have relied on estimating audiovisual interactions by comparing the response to the multisensory stimulus and a combination of the responses to the unisensory stimuli presented in isolation. In the present study, the subjects received no auditory stimulation, but rather received only visual stimuli consisting of scenes strongly linked (or not linked) to a sound association (as estimated by an independent group of viewers); these included an image of a man playing a trumpet or an image of a sleeping child. All of the images (see Fig. 6 for some examples) were carefully matched for their size, average luminance, luminance profile, affective value and presence of animals or humans and differed only in their degree of auditory content. High-density EEG was recorded from 15 right-handed volunteers and swLORETA was performed on the brain activity related to sound and non-sound processing, as well as on their differential activation.


Occipital P1 was not affected by stimulus category, neither in latency (F1,13 = 0.009; p = 0.93; sound = 105 ms, non-sound = 105 ms), nor in amplitude (F1,13 = 0.0003; p = 0.99; sound = 6.52 µV, non-sound = 6.52 µV), as clearly appreciable by looking at ERP waveforms of Fig. 1 (Top) and relative topographical maps (Bottom).

Figure 1
figure 1

(Top) Grand-average ERP waveforms recorded at left and right mesial occipital sites in response to sound and non-sound stimuli.

(Bottom) Topographical maps obtained by plotting the colour-coded average voltage recorded in the 100–120 time window in response to sound and non-sound stimuli. It can be appreciated that, while both waveforms and maps relative to the early sensory visual activity (P1) were not affected whatsoever by stimulus content, sound stimuli elicited a stronger negativity (N1) having a fronto-central distribution.

Frontal N1 was differentially affected by stimulus category (F1,13 = 4.44; p < 0.05; є = 1), being larger in response to sound stimuli than to non-sound stimuli (sound = −3.41 µV, SE = 0.51; non-sound = −2.74 µV, SE = 0.38); this is illustrated in the waveforms shown in Fig. 2. N1 reached its maximum amplitude at central (C3, C4) sites (F(2, 20) = 18.5; p < 0.00005; є = 0.31). The frontal N2 response was also differentially affected by stimulus content (F1,13 = 4.87; p < 0.045; є = 1), having a greater amplitude in response to sound stimuli than to non-sound stimuli (sound = −4.94; µV, SE = 1.08; non-sound = −4.36 µV; SE = 1.02), as showed in topographical maps of Fig. 1 (Bottom). N2 reached its maximum amplitude at central (C3, C4) sites (F2, 25 = 19.7; p<0.00001; є = 0.39). To identify the intracranial sources of the increased bioelectrical activity elicited by sound stimuli, two swLORETAs (displayed in Fig. 3) were applied to the difference voltages obtained by subtracting ERPs to non-sound from ERPs to sound stimuli in the two time windows of 100–120 ms (corresponding to N1 peak) and 205–225 ms (corresponding to N2 peak). The results are reported in Table 1, showing a list of electromagnetic dipoles explaining the difference voltages, along with their Talairach coordinates. In the first time window it was found an activation of the left MTG (BA21), along with the right MOG and medial frontal gyrus. After about 100 ms the signal power was stronger and included the activation of the left middle frontal gyrus, the right STG (BA38), the left ITG (BA20) and the left STG (BA41), the latter corresponding to the primary auditory cortex.

Table 1 Talairach coordinates corresponding to the intracortical generators, which explain the surface voltage recorded during the 100–120 and 205–225 ms time windows, respectively, in response to sound and non-sound stimuli. Magnitude is expressed in nAm; H = hemisphere; BA = Brodmann area.
Figure 2
figure 2

Grand-average ERP waveforms recorded at left and right fronto-central sites in response to sound and non-sound stimuli.

Figure 3
figure 3

Sagittal view of intra-cranial active sources explaining the difference voltage sound – non-sound stimuli computed for the two time windows of 100–120 ms (corresponding to N1 peak) and 205–225 ms (corresponding to N2 peak ).

The different colours represent differences in the magnitude of the electromagnetic signal (in nAm). The electromagnetic dipoles are shown as arrows and indicate the position, orientation and magnitude of dipole modelling solution applied to the ERP waveform in the specific time window. The two sagittal sections are centred on the left MTG (BA21) and the right STG (BA38), respectively. L = left; R = right; numbers refer to the displayed brain slice in sagittal view. The first is a left hemispheric view, the second is a right hemispheric view.

The later P3 response (600–800 ms) was larger in response to sound stimuli than to non-sound stimuli (F1,13 = 5.97; p < 0.042). The significant interaction of stimulus category x hemisphere (F1,13 = 5.1; p < 0.03) and relative post-hoc comparisons were indicative of larger sound vs. non-sound differences over the left hemisphere (LH) compared with the right hemisphere (RH: sound = 1.66, non-sound = 1.39 µV; LH: sound = 1.86, non-sound = 1.25 µV), as shown in Fig. 4.

Figure 4
figure 4

Grand-average ERP waveforms recorded at left and right temporo-parietal and posterior-temporal sites in response to sound and non-sound stimuli.

To locate the possible neural source of the auditory content effect, two different swLORETA source reconstructions were performed independently for the sound and non-sound stimuli during the 600–800-ms time window, which corresponds to the peak of the temporal P3. The inverse solution is displayed in Fig. 5 and shows that the processing of both stimuli classes was associated with a common set of left and right generators (listed in Table 2) located in the ventral stream and devoted to both object/face processing (e.g., BA20 and BA37) and scene encoding. However, only perceived sound stimuli activated the superior temporal gyrus (BA38). In order to ascertain which regions were more robustly activated specifically during sound processing in the P3 latency range, an additional swLORETA was computed for the difference signals obtained by subtracting the bio-electric non-sound activity from the sound activity recorded during the 600–800-ms time window. The electromagnetic dipoles (listed in Table 3) represent intra-cranial sources of activity that were significantly stronger in response to sound than non-sound stimuli; the ITG, MTG and STG cortices (BA20, 21 and 38, respectively) were among the strongest foci.

Table 2 Talairach coordinates corresponding to the intracortical generators, which explain the surface voltage recorded during the 600–800-ms time window in response to sound and non-sound stimuli. Magnitude is expressed in nAmp; H = hemisphere; BA = Brodmann area.
Table 3 Intracranial generators relative to the difference signal obtained by subtracting the bio-electric non-sound response from the sound response recorded during the 600–800-ms time window. The listed electromagnetic dipoles represent sources of activity that respond significantly more strongly in response to sound than non-sound stimuli. The strongest responding foci included the right ITG, MTG and STG (BA20, 21 and 38, respectively).
Figure 5
figure 5

Sagittal view of intra-cranial active sources for the processing of sound (left) and non-sound stimuli (right) according to the swLORETA analysis during the 600–800-ms time window.

Evident is a stronger sound-related temporal activation, which likely reflects the processing of sound objects.


This early effect of multisensory integration is consistent with previous reports comparing multisensory audiovisual stimuli with unimodal visual or auditory stimuli11,12.

The lack of any visual sensory stimulus-dependent modulation of ERPs suggests than the differences found between sound vs. non-sound stimuli were not due to their perceptual characteristics, but, very likely, to the auditory content of visual information carried out by sound stimuli.

As for the earliest effect at N1 level (100–120 ms), the inverse solution applied to the difference voltage sound minus non-sound showed that the main sources of activity for this effect were not entirely visual (MTG, MOG, rMFG). It cannot be excluded that the early right medial frontal activation reflected an attention modulation, besides multisensory integration processes. However the role of medial frontal cortex in auditory processing has also been established. For example, Anderer et al.13 applied LORETA source reconstruction to auditory ERPs recorded in an oddball task, finding an activation of the superior temporal gyrus [auditory cortex, Brodmann areas (BA) 41, 42, 22] for both N1 and N2 responses and also a medial frontal source (BA 9, 10, 32) for N2 response. An early activation of both occipital, temporal and frontal cortices for multisensory audio-visual (AV) processing was reported by a recent fMRI study14 in which subjects passively perceived sounds and images of objects presented either alone or simultaneously. After AV stimulation, a significant activity (after 6–7- sec) was observed in superior temporal gyrus, middle temporal gyrus, right occipital cortex and inferior frontal cortex, besides the right Heschl's gyrus, thus suggesting the crucial role of these areas in object-dependent audio-visual integration.

According to Näätänen and Winkler15, the fronto-central N1 (100 ms) response reflects the initial access to mental auditory representation, whereas the fronto-central N250 (200–250 ms) response indexes the stage of multisensory integration, with visual inputs coming from the ventral stream. Other electrophysiological studies (e.g., Ref. 16) found an increase in anterior N2 amplitude while imaging an auditory stimulus, which likely suggests activation of an auditory mental representation.

Considering the visual and implicit nature of our experiment—the participants were actively looking for target scenes (cycle races) while ignoring other images—our ERP data indicate an automatic and early access to object sound properties. Studies of multimodal integration11 have suggested an early activation of audiomotor neurons at about 100 ms that is followed by more robust activity in a later time window (210–350 ms). This activity would involve regions of the associative temporal cortex (MTG and STG, among others), as shown by the swLORETA inverse solutions performed on our N1, N2 and P3 data. Interestingly, direct neurophysiological data9 suggest that the STS is an integration area for visual and auditory inputs (such as the sight of an action and its corresponding sound), thus demonstrating the existence of audiovisual mirror neurons.

In conclusion, we provide evidence that the mere sight of scenes and objects typically associated with sound will automatically activate auditory representation in several regions within the associative temporal and even auditory primary cortex. Moreover, these regions are known to be engaged in the perception of complex sounds17, audiovisual processing of speech stimuli18, audiovisual integration19 and auditory verbal hallucinations20,21, which tend to be selectively associated with right STS activation.



Fifteen healthy right-handed university students (8 men and 7 women) participated in this study as unpaid volunteers. They earned academic credit for their participation. Their mean age was 22.8 years, ranging from 20 to 27 years. All had normal or corrected-to-normal vision and reported no history of neurological illness or drug abuse. Their right-handedness and right ocular dominance were confirmed using the Italian version of the Edinburgh Handedness Inventory, a laterality preference questionnaire. All experiments were conducted with the understanding and written consent of each participant. No participant was excluded for technical reasons. The experimental protocol was approved by the ethics committee of the University of Milano-Bicocca.

Stimuli and materials

The stimulus set consisted of 300 complex ecological scenes. The pictures were downloaded from Google Images (the examples reported in Fig 6 are custom-made and copy-right free). The two classes of stimuli (sound and non-sound) were matched for their size (350 × 350 pixels), luminance (41.92 cd/cm2), affective value and presence of animals or persons. Half of the images (150) evoked a strong auditory image (sound stimuli), whereas the other half were not linked to any particular sound (non-sound stimuli). The stimulus set was selected from a larger set of images by presenting them to a group of 20 judges (10 men and 10 women) and asking them to score whether they evoked an auditory association using a 3-point scale (with 2, 1 and 0 being strong, weak and absent auditory content, respectively).

Figure 6
figure 6

Example images of stimuli in the sound and non-sound categories.

To provide a clear distinction between the sound and non-sound stimulus groups, pictures scoring an average value of 0.5–2 were placed in the sound category, whereas pictures scoring a value of 0 were placed in the non-sound category. A t-test applied to the 2 groups confirmed that their auditory contents were significantly different (Sound = 1.41, SE = 0.37; Non-sound = 0; t-value = 46.58; p < 0.05). Three hundred (150 sound and 150 non-sound) images meeting the above criteria were then selected to create the final stimulus set; some example images are shown in Fig. 6.

The stimuli in the 2 classes were also matched for their affective value by presenting the pictures to a group of 10 judges (5 men and 5 women) different than those used above and asking them to evaluate the stimuli in terms of their affective content using a 3-point scale (with 2, 1 and 0 being strong, weak and null affective value, respectively). A t-test applied to the 2 groups confirmed that their affective values were not significantly different (Sound = 0.76; Non-sound = 0.66; t-value = 1.68; p = 0.09).

Twenty-five additional photos depicting a cycle race were included in the stimulus set for the subjects to perform a secondary task (described below); these images were of similar average luminance, size and spatial distribution as the other images. The sound and non-sound images were presented in random order together with the 25 cycle race photos. The stimulus size was 14.2 × 14.2 cm subtending a visual angle of 6°43′01″. Each image was presented for 1000 ms against a dark grey background at the center of a computer screen with an ISI of 1500–1900 ms.

Task and procedure

The participants were comfortably seated in a darkened test area that was acoustically and electrically shielded. A high-resolution VGA computer screen was placed 120 cm in front of their eyes. The subjects were instructed to gaze at the center of the screen (where a small circle served as a fixation point) and to avoid any eye or body movement during the recording session. The stimuli were presented in random order at the center of the screen in 6 different randomly mixed short runs lasting approximately 2 minutes and 40 seconds. To keep the subject focused on the visual stimuli, the task consisted of responding as accurately and quickly as possible to photos displaying cycle races by pressing a response key with the index finger of the left or right hand; all other photos were to be ignored. The left and right hands were used alternately throughout the recording session and the order of the hand and task conditions were counterbalanced across the subjects. For each experimental run, the target stimuli varied between 3–7 and the presentation order differed among the subjects.

EEG recording and analysis

The EEG data were continuously recorded from 128 scalp sites at a sampling rate of 512 Hz. Horizontal and vertical eye movements were also recorded and linked ears served as the reference lead. The EEG and electro-oculogram (EOG) were filtered with a half-amplitude band pass of 0.016–100 Hz. Electrode impedance was maintained below 5 kΩ. EEG epochs were synchronized with the onset of stimulus presentation. Computerized artifact rejection was performed prior to averaging to discard epochs in which eye movements, blinks, excessive muscle potentials or amplifier blocking occurred. The artifact rejection criterion was a peak-to-peak amplitude exceeding 50 μV and resulted in a rejection rate of 5%. Evoked-response potentials (ERPs) from 100 ms before through 1000 ms after stimulus onset were averaged off-line. ERP components (including the site and latency to reach maximum amplitude) were identified and measured with respect to the baseline voltage, which was averaged over the interval from −100 ms to 0 ms.

The peak amplitude and latency of sensory P1 response was measured at mesial occipital (O1, O2) and lateral occipital (POO9h, POO10h) electrode sites, in the 80–120 ms time window. The mean amplitude of frontal N1 and N2 were measured at the left and right central (C1, C2, C3, C4), frontal (F1, F2, F3 and F4) and fronto-central (FC1, FC2, FC3 and FC4) electrode sites in the 100–120-ms and 200–275-ms time windows, respectively. The mean amplitude of the temporal P3 component was measured at the posterior temporal and temporo-parietal (T7, T8, TTP7h and TTP8h) electrode sites in the 600–800-ms time window. Multifactorial repeated measures were applied to the ERP data using the following within factors: stimulus category (Sound, Non-Sound), electrode (according to the ERP component of interest) and hemisphere (Left, Right). Multiple comparisons of means were performed by the post-hoc Tukey test. The alpha inflation due to multiple comparisons was corrected by means of Greenhouse-Geisser correction. The degrees of freedom accordingly modified are reported, together with ε and corrected probability level.

Low-Resolution Electromagnetic Tomography (LORETA) was performed on the ERP waveforms at the latency stage where the sound/non-sound difference was greatest, namely, at N1, N2 and P3 levels. LORETA22 is a discrete linear solution to the inverse EEG problem and corresponds to the 3D distribution of neuronal electrical activity that has maximally similar (i.e., maximally synchronized) orientation and strength between neighboring neuronal populations (represented by adjacent voxels). In this study, an improved version of standardized weighted LORETA was used; this version, called swLORETA, incorporates a singular value decomposition-based lead field weighting method. The source space properties included grid spacing (the distance between two calculation points) of 5 points and an estimated signal-to-noise ratio (which defines the regularization; a higher value indicates less regularization and therefore less blurred results) of 3. SwLORETA was performed on the group data and identified statistically significant electromagnetic dipoles (p < 0.05) with larger magnitudes correlating with more significant activation. A realistic boundary element model (BEM) was derived from a T1-weighted 3D MRI data set by segmentation of the brain tissue. This BEM model consisted of one homogenous compartment comprised of 3,446 vertices and 6,888 triangles. The head model was used for intracranial localization of surface potentials. Both segmentation and generation of the head model were performed using the ASA software program.