Here we present an update of the studyforrest (http://studyforrest.org) dataset that complements the previously released functional magnetic resonance imaging (fMRI) data for natural language processing with a new two-hour 3 Tesla fMRI acquisition while 15 of the original participants were shown an audio-visual version of the stimulus motion picture. We demonstrate with two validation analyses that these new data support modeling specific properties of the complex natural stimulus, as well as a substantial within-subject BOLD response congruency in brain areas related to the processing of auditory inputs, speech, and narrative when compared to the existing fMRI data for audio-only stimulation. In addition, we provide participants' eye gaze location as recorded simultaneously with fMRI, and an additional sample of 15 control participants whose eye gaze trajectories for the entire movie were recorded in a lab setting—to enable studies on attentional processes and comparative investigations on the potential impact of the stimulation setting on these processes.
Background & Summary
The investigation of the interplay of cognitive processes in a complex natural environment is an emerging research focus1. There is a growing number of publicly available datasets to facilitate this kind of research (Data Citation 1: CRCNS.org http://dx.doi.org/10.6080/K0QN64NGx,Data Citation 2: CRCNS.org http://dx.doi.org/10.6080/K0JS9NC2,Data Citation 3: CRCNS.org http://dx.doi.org/10.6080/K00Z715X,Data Citation 4: OpenfMRI ds000149) that primarily focus on aspects of processing sensory information in the visual domain. Complementing these resources, we previously released a high-resolution BOLD fMRI dataset on the processing of language and auditory information in a two-hour audio movie version of the Hollywood motion picture ‘Forrest Gump’—the studyforrest dataset (http://studyforrest.org)2.
One of the main challenges of working with natural stimulation paradigms is the presence of potential confounds that could impair the interpretability of results. Consider the case of an audio movie stimulus where the portrayal of emotional arousal happens to be correlated with loudness of the stimulus. In this case, it would be hard to dissociate the neural signatures of low-level stimulus properties from the neural representation of the perceived emotional state of a story character.
This problem can be addressed in two ways. First, by aiming to compose a complete description of the covariance structure of stimulus properties to allow for the discovery and quantification of potential confounds. Such a description should include low-level properties, often suitable for extraction with computer algorithms, and high-level features, such as a characterization of portrayed emotions3 that (still) requires human observers. Another approach is the combination of several stimuli that feature a different covariance structure of their particular properties in order to disentangle signatures of individual neural processes (see, for example, Hanke et al.4 for an extension of our original dataset with BOLD fMRI data on the perception of music for the same participants).
Here we present a further extension of the studyforrest dataset with data from 15 of the original participants. About a year after their initial participation, they watched and listened to the audio-visual ‘Forrest Gump’ movie while fMRI data was being acquired, using the exact same movie cut and a stimulation that was synchronized with the previous acquisition. This additional stimulus delivers the same complex, two-hour story as the previous audio movie, but adds an enormous amount of visual information that detail and uniformly manifest new aspects of story content that were previously un(der)defined and left to a participant's imagination—such as the specific composition of the visual scenery or the details of facial expressions.
This extension with an audio-visual movie enables the investigation of the representation of visual and multi-sensory input in a complex natural setting. Moreover, it facilitates comparative analyses5 with other existing datasets on movie perception, as it is more similar in terms of its acquisition parameters and the nature of the stimulus content. At the same time, the combined dataset offers the unique opportunity to study the within-subject similarity of information representation across stimulus domains in the context of a complex and prolonged story narrative.
Attentional processes are likely to play a comparatively larger role in the selective processing of audio-visual vs. audio-only stimulation. With the goal of investigating the representation of information extracted from complex multi-sensory input, it is therefore of paramount importance to assess what particular aspects of the stimulus a participant is paying attention to at any point in time. In order to capture the location of the focus of attention, we recorded participants eye gaze coordinates for the entire duration of the movie simultaneously with the fMRI acquisition. This additional data modality enables further areas of investigation, such as which stimulus features drive attentional selection processes, but it also makes it possible to attempt to directly model aspects of the BOLD signal based on a behavioral measurement of visual spatial attention6. In order to be able to assess a potential influence of the stimulation setting in the scanner (small screen, participants are on their backs looking up), we include an additional sample of 15 participants who watched the movie in a lab setting on a larger screen while sitting in a comfortable chair.
In summary, this extension of the studyforrest dataset described here, together with additional new fMRI data on retinotopic mapping and localization of visual areas described in a companion publication7, substantially increases the bandwidth of questions that can be investigated. Furthermore, the released data can be used to further improve the description of the stimulus itself: What object categories are being attended? When are eye movements made to track moving objects rather than switching the focus to a different object? Do participants at any given time explore the scenery, tune in on the dialog, or focus on a particular action being performed? Such future additions will help to further expand the scope and improve the interpretability of results.
The information presented here is limited to a detailed description of aspects relevant to the simultaneous BOLD fMRI and eye gaze recording. For details on the general methods and a sample description, we refer the reader to the companion article7.
All 15 participants described in Sengupta et al.7 also volunteered for the simultaneous fMRI and eye gaze recording in this study. Consequently, additional BOLD fMRI data for all participants have been made available previously2,4 and participant IDs are matched across all studies. One participant still had not seen the audio-visual ‘Forrest Gump’ movie despite having participated in the audio-only movie study. The native language of all participants was German.
Although some participants posed substantial challenges for the eye tracking setup, such as partially obscured pupils due to long eye lashes or low contrast between iris and pupil due to insufficient lighting as a result of the head position in the head coil, no participant was excluded from the study. For those participants, the eye gaze recording was performed on a best-effort basis in order to maintain homogeneous procedures across participants while recording BOLD fMRI. Table 1 lists the affected participants. Participants in the fMRI experiment underwent a dedicated ophthalmologic exam to confirm normal vision7.
15 additional participants (age 19–30, mean 22.4, 10 females) volunteered for a separate eye tracking-only experiment. Two of those participants had never seen the movie ‘Forrest Gump’ before. All participants had normal or corrected-to-normal vision. These participants have a numerical subject ID that is greater than 20.
Participants were fully instructed about the nature of the study, gave their informed consent for participation in the study as well as for public sharing of all obtained data in anonymized form, and received monetary compensation. This study was approved by the Ethics Committee of Otto-von-Guericke University (approval reference 37/13).
The procedures were highly similar to those described in Hanke et al2. Before the scan, participants filled out a questionnaire on their basic demographic information and familiarity with the ‘Forrest Gump’ movie.
fMRI data acquisition was split into two sessions that immediately followed each other on the same day. All eight movie segments were presented individually in chronological order with four segments in each session. Between sessions, participants left the scanner for a break with a flexible duration. On average, participants chose to continue with the second session after approximately 10 min.
At the start of the movie recording session, while auxiliary scans were performed, participants listened to music from the movie's closing credits. During this time, the optimal stimulus volume was determined for each participant individually. Participants were instructed to ‘maximize the volume of the stimulus without it becoming unpleasantly loud or causing acoustic distortions by overdriving the loudspeaker hardware’. After the start of the movie, participants still had the ability to change the volume. Any adjustments were made within the first few seconds of the first movie segment; the volume remained constant across all movie segments otherwise.
Immediately prior to each movie segment, the eye tracker setup was calibrated (see eye tracking setup) and immediately followed by an accuracy validation. Directly after each movie segment ended, another gaze accuracy validation was performed.
After every movie segment, when the scanner was stopped and the gaze accuracy validation was completed, participants were asked to rate their experience during the preceding segment (‘How deeply did you get into the story?’) on a scale from 1 (not at all) to 4 (a lot). Participants responded by pressing a button on a four-button response board with their right hand. This rating was followed by a brief break, and the recording continued as soon as the participant requested to continue by pressing a button.
Participants were instructed to inhibit any physical movements apart from eye movements, as best they could, throughout the recording sessions. Other than that, participants were instructed to simply ‘enjoy the movie’.
Procedures for the behavioral eye tracking session were practically identical, with the exceptions that there was no break between the first and second half of the movie and there was no dedicated audio volume calibration session. Instead, participants could manually adjust the volume throughout the experiment if they desired.
Participants watched and listened to the movie ‘Forrest Gump’ (R. Zemeckis, Paramount Pictures, 1994, dubbed German soundtrack). The stimulus source was the commercially available high-resolution Blu-ray disk release of the movie from 2011 (EAN: 4010884250916). Temporal alignment of the movie with the previously used2 DVD release was manually verified by audio waveform matching in order to guarantee synchronous stimulus time series between studies. The video track of the Blu-ray was extracted, re-encoded as H.264 video (1280×720 at 25 fps), and muxed with the DVD's dubbed German soundtrack using the MLT Multimedia Framework (http://www.mltframework.org; version 0.8.0–4, retrieved from the Debian archive).
Analog to the previous study, the audio track was processed by a series of filters in order to improve the audibility of the stimulus in the noisy environment of the scanner during echo planar imaging (EPI). Major noise components were located at 690 Hz and 1285 Hz as well as their harmonics at 2070 Hz and 4140 Hz. More detailed information on the noise signature of the employed EPI sequence can be found in Angenstein, et al. (Fig. 1)8.
For filtering, first a multi-band equalizer was used to implement a high-pass filter (−70 db attenuation in the 50 Hz band and −10 db at 100 Hz) to remove low frequencies that would have caused acoustic distortions in the headphones at high volume. This filter was followed by a dynamic range compressor (attack 128 ms, release 502 ms, 2:1 compression ratio above −18 db), a hard limiter (attack 128 ms, release 502 ms, cut off −12 db), and lastly a 3 db volume gain. This chain implemented a conceptually similar filter as in the audio-only movie study2; however, the settings were optimized to the particular stimulation setup and environment. As a result, the filtered audio track used previously2 differs slightly from the audio-only movie stimulus, even in segments where the original soundtrack is completely identical between the audio-only and audio-visual movie.
Subsequently, the movie stimulus was shortened and cut into the same eight segments, approximately 15 min long each, as in the audio-only movie study. Additionally, the same fade-in/fade-out ramps were applied to the audio and video track (see Table 1 and Fig. 3a in ref. 2).
Movie segment stimulus creation
As before, all video/audio editing was performed using the ‘melt’ command-line video editor on a PC running the (Neuro)Debian9 operating system. As the actual commands are highly similar to those previously reported2, only key differences are described. The full source code for the stimulus generation is included in the data release in the code/stimulus/movie directory. This directory also contains an annotation of all cuts in the movie segments (time code and zero-based frame number of the first frame in each shot). This information can be used to verify the timing of a reproduced stimulus.
All video files were created using a high-definition rendering profile (
-profile hdv_720_25p). The watermark plug-in was used to replace the original black horizontal bars at the top and bottom of the movie frames with medium-gray bars of the same size in order to increase background illumination for a more pleasant experience
Audio processing, as described above, was implemented using the following filter settings and utilizing two LADSPA-plugins (1197, Steve Harris, Ushodaya Enterprises Limited; 2152, Tom Szilagyi, Meltytech, LLC).
#Multiband EQ -attach-track ladspa.1197 0=−70 1=−10 #Compressor -attach-track ladspa.2152 0=128 1=502 2=0 3=20 6=3 #Limiter -attach-track ladspa.2152 0=128 1=502 2=0 3=−20 6=10 #Volume gain -attach-track volume gain=3
Video output specification options were:
f=matroska acodec=libmp3lame ab=256k vcodec=libx264 b=5000k
Stimulation and eye tracking setup for fMRI acquisition
The visual stimulation setup was as described in the companion article7. Importantly, the movie was shown at a viewing distance of 63 cm in 720p resolution at full width on a 1280×1024 pixel screen that was 26.5 cm wide—corresponding to 23.75°×13.5° of visual angle or 23.75°×10.25° when considering only the movie content and excluding the horizontal gray bars.
Eye tracking was performed using monocular corneal reflection and pupil tracking with an Eyelink 1000 (software version 4.594) equipped with an MR-compatible telephoto lens and illumination kit (SR Research Ltd., Mississauga, Ontario, Canada). The temporal resolution of the eye gaze recording was 1000 Hz. The eye tracking camera was mounted just outside the scanner bore, approximately centered, viewing the left eye of a participant at a distance of about 100 cm through a small gap between the top of the back projection screen and the scanner bore ceiling. An infrared light source, mounted slightly lower, illuminated the observed eye through the gap between the left side of the back projection screen and the scanner bore. The eye tracker was calibrated using a 13-point sequence (black dot on a medium gray, RGB(150,150,150) background) that covered the entire display. For accuracy validation, participants had to fixate on the same 13 points, and offsets to the target coordinates were determined. Calibration was repeated as often as necessary until a sufficient consensus was achieved (see Technical Validation for accuracy estimates).
For participants 2, 10, and 20, the movie was presented vertically centered on the screen (the top movie pixel line below the gray horizontal bars was at y=239 px). For all other participants, the movie display was shifted upwards by 171 px to yield a slightly more open eyelid and a better illumination of the pupil for improved eye tracking reliability.
Auditory stimulation was delivered through an MR confon mkII+ driving electrostatic headphones (HP-M01, MR confon GmbH, Magdeburg, Germany)11 fed from an Aureon 7.1 USB (Terratec) sound card through an optical connection. A participant's head was fixed using a cushion with attached earmuffs containing the headphones. In addition, the participants wore earplugs. Headphones and earplugs each reduce the scanner noise by at least 20–30 dB, depending on the frequency.
fMRI Response-/ synchronization setup
The TTL trigger signal emitted by the MRI scanner was fed to a Teensy3 microcontroller (PJRC.COM, LLC, Sherwood, OR, USA). The open collector signal from the response board (ResponseBox 1.2, Covilex, Magdeburg, Germany) was also fed into the Teensy3. A simple ‘teensydurino sketch’ was used to convert the signals to USB keyboard events (‘t’, ‘1’, ‘2’, ‘3’, ‘4’) at the stimulus computer.
Stimulus presentation and eye gaze recording were synchronized with the fMRI acquisition by automatically starting the eye gaze recording as soon as the stimulus computer received the first fMRI trigger signal. Moreover, the timing of subsequent trigger pulses was logged on the stimulus computer, and the onset of every movie frame (target frequency 25 fps) was logged as part of the eye gaze recording via a custom log message to the eye tracker at the moment of the respective video buffer flip.
Due to the specific properties of the optical DVI extension system installed at the scanner, there was no direct synchronization of the video output of the stimulus computer with the refresh cycle of the projector. Moreover, it introduced a temporal offset and uncertainty between the recorded video update on the stimulus computer and in the eye tracker with respect to its actual appearance on the back-projection screen in the scanner. Separate measurements with a flickering test stimulus and a photo diode setup in the scanner-bore yielded a delay between target and actual onset time of 6–8 video refresh cycles (100–133 ms at 60 Hz). No participant noted any audio-visual stimulus asynchrony.
fMRI data acquisition
T2*-weighted echo-planar images (gradient-echo, 2 s repetition time (TR), 30 ms echo time, 90° flip angle, 1943 Hz/Px bandwidth, parallel acquisition with sensitivity encoding (SENSE) reduction factor 2) were acquired during stimulation using a whole-body 3 Tesla Philips Achieva dStream MRI scanner equipped with a 32 channel head coil. 35 axial slices (thickness 3.0 mm) with 80×80 voxels (3.0×3.0 mm) of in-plane resolution, 240 mm field-of-view (FoV), anterior-to-posterior phase encoding direction) with a 10% inter-slice gap were recorded in ascending order—practically covering the whole brain. Philips' ‘SmartExam’ was used to automatically position slices in AC-PC orientation such that the topmost slice was located at the superior edge of the brain. This automatic slice positioning procedure was identical to the one used for scans reported in the companion article7 and yielded a congruent geometry across all paradigms.
The number of volumes acquired per movie segment was 451, 441, 438, 488, 462, 439, 542, and 338 volumes (for movie segments 1–8 respectively), and was therefore identical to the audio-only movie study2.
Pulse oximetry and respiratory trace were recorded simultaneously with BOLD fMRI acquisition for the entire duration of the movie. The acquisition setup and the properties of the released data are described in the companion article7.
Stimulation and eye tracking setup for in-lab acquisition
The stimulation setup for the in-lab acquisition was identical to the one at the MRI scanner except for the following differences: The screen for stimulus presentation was a BenQ XL2410T LCD monitor of size 522×294 mm, displaying a resolution of 1920×1080 px with a vertical refresh rate of 120 Hz, and it was directly connected to the stimulus computer. The screen was positioned at a viewing distance of 85 cm, from which participants watched the movie while sitting in a slightly reclined chair—their head movements constrained by a U-shaped headrest that covered the back of the head. A different Eyelink 1000 with a standard desktop mount (software version 4.51; SR Research Ltd., Mississauga, Ontario, Canada) was used to record eye gaze coordinates at 1000 Hz using corneal reflection and pupil tracking of the left eye. In this configuration, the movie stimulus (excluding the gray horizontal bars at the top and bottom) subtended 34×15° of visual angle.
All custom source code for data conversion from raw, vendor-specific formats into the de-identified released form is included in the data release (
code/rawdata_conversion). fMRI data conversion from DICOM to NIfTI format was performed with
heudiconv (https://github.com/nipy/heudiconv), and the de-identification of these images was implemented with mridefacer (https://github.com/hanke/mridefacer). The data release also contains the implementations of the stimulation paradigm.
This dataset is compliant with the Brain Imaging Data Structure (BIDS) specification12, which is a new standard to organize and describe neuroimaging and behavioral data in an intuitive and common manner. Extensive documentation of this standard is available at http://bids.neuroimaging.io. This section provides information about the released data, but limits its description to aspects that extends the BIDS specifications. For a general description of the dataset layout and file naming conventions, the reader is referred to the BIDS documentation. In summary, all files related to the movie data acquisition of a particular participant can be located in a
sub-<ID>/ses-movie/ directory, where ID is the numeric subject code.
In order to de-identify data, information on center-specific study and subject codes have been removed using an automated procedure. All human participants were given sequential integer IDs. Furthermore, all BOLD images were ‘de-faced’ by applying a mask image that zeroed out all voxels in the vicinity of the facial surface, teeth, and auricles. For each image modality, this mask was aligned and re-sliced separately. The resulting tailored mask images are provided as part of the data release to indicate which parts of the image were modified by the de-facing procedure (de-face masks carry a
_defacemask suffix to the base file name).
If available, complete device-specific parameter protocols are provided for each acquisition modality.
Data from the participants' self-reports on basic demographic information and familiarity with the ‘Forrest Gump’ movie are available in
participants.tsv, a tabulator-separated value (TSV) file. This is an update of the demographics presented in the orginal data descriptor2. A description of all data columns is given in Table 2. This file contains information on both participant groups (MRI and in-lab eye tracking). MRI participants have a numerical ID of 20 or less.
fMRI data files for the movie stimulation contain a
*ses-movie_task-movie*_bold pattern in their file name. Each image time series in NIfTI format is accompanied by a JSON sidecar file that contains a dump of the original DICOM metadata for the respective file. Additional standardized metadata is available in the task-specific JSON files defined by the BIDS standard.
Time series of pleth pulse and respiratory trace are provided for all BOLD fMRI scans in a compressed three-column text file: volume acquisition trigger, pleth pulse, and respiratory trace (file name scheme:
_recording-cardresp_physio.tsv.gz). The scanner’s built-in recording equipment does not log the volume acquisition trigger nor does it record a reliable marker of the acquisition start. Consequently, the trigger log has been reconstructed based on the temporal position of a scan's end-marker, the number of volumes acquired, and with the assumption of an exactly identical acquisition time for all volumes. The time series have been truncated to start with the first trigger and end after the last volume has been acquired.
Eye gaze recordings
Eye movement recordings are provided in two flavors: 1) raw, vendor-specific log files and 2) normalized gaze time series in a BIDS-compliant text file.
Raw eye gaze data in Eyelink ASCII format
For each movie segment, a dedicated log file with all raw data from the eye tracker is provided. It has been created by converting the original binary data file into ASCII text format using the vendor-supplied
edf2asc tool and removing two header lines containing participant identifying information. Additionally, all files have been compressed with
gzip. These files include annotated calibration parameters, acquisition time, gaze coordinates (x,y) in screen pixels, and pupil dilation (area); there are also integrated with basic information about automatically detected eye movements such as saccades, fixations, and blinks.
Files for eye gaze recordings performed simultaneously with fMRI data acquisition can be found at
sub-<ID>/ses-movie/func/sub-<ID>*_eyelinkraw.asc.gz, whereas data from the separate lab experiment is available at
Note that all values in the data files from the in-scanner eye gaze recordings that specify information in units involving visual degrees cannot be used as such. Due to a misconfiguration, these measurements are based on a screen width of 37.6 cm instead of the actual 26.5 cm. Consequently, the actual value are about 70% smaller than the records in the files. This issue does not affect the actual eye gaze coordinates (recorded in screen pixels) nor does it affect the raw data files of the in-lab recordings. Moreover, the pupil area measurement was not calibrated, hence the actual physical pupil area cannot be computed from these measurements.
Normalized eye gaze data
In addition, eye gaze coordinate time series are provided for all movie segments in a gzip-compressed tab-separated values text file. Each file contains four columns. The first two contain the X and Y coordinates of the eye gaze, followed by a pupil area measurement, and the numerical ID of the movie frame presented at the time of the measurement (the very first frame in each segment has an ID of 1), as recorded by the eye tracker. The sampling rate is uniformly 1000 Hz, resulting in 1000 lines per second, with the first line corresponding to the onset of the movie stimulus.
These data have been normalized such that all gaze coordinates are in native movie frame pixels, with (0,0) being located at the top-left corner of the movie frame (excluding the gray bars) and the lower-right corner (again without the bar) located at (1280,546)—correcting for the different display resolutions between in-scanner and in-lab recording and varying display location for the in-scanner recording. Moreover, the in-scanner recordings have been temporally normalized by shifting the time series by the minimal video onset asynchrony of 100 ms temporal normalization. The implementation of the normalization procedure can be found at
In-scanner eye gaze time series are located at
sub-<ID>/ses-movie/func/sub-<ID>*_recording-eyegaze_physio.tsv.gz and in-lab recordings are at
Stimulus timing information for each recording segment are provided in
BIDS' *_events.tsv files. These six-column text files describe the
duration of each movie frame (
frameidx) with respect to the MRI volume acquisition trigger signal (
lasttrigger) as recorded on the stimulus computer (complementing the timing information of the movie frame index column in the eye tracking data files). Moreover, they also contain the timing log of the video
videotime and audio
audiotime stream of the stimulus movie, as reported by PsychoPy's video component. The reported
videotime corresponds to the end of a frame display in the video stream, so that the very first frame corresponds to a time of 40 ms for this 25 Hz video material.
Subjective story ratings
All participant ratings of subjective story depth are provided in a single JSON file for all movie segments combined:
ses-movie/sub-??_ses-movie_task-movie_bold.json and for non-fMRI participants in
All analyses presented in this section were performed on the released data in order to test for negative effects of de-identification or applied normalization procedures on subsequent analysis steps.
During data acquisition, technical problems were noted in a log. All known anomalies and their impact on the dataset are detailed in Table 1.
fMRI data quality
Participant movement is one of the key factors that negatively impact fMRI data quality. We estimated movement by aligning all time series images to a common participant-specific brain template image (rigid-body transformation implemented with FSL's FLIRT). Figure 1 depicts separate summary statistics for estimates of translatory and rotary motion. With the exception of a few outliers (see Table 1), motion-related deviation from the reference image was found to be less than 1.5 mm and 1.5°.
For a description of the overall signal quality in terms of temporal signal-to-noise ratio, the reader is referred to the companion publication7.
Assessing fMRI data quality for natural stimulation paradigms beyond low-level metrics is a challenge due to the complexity of the employed stimulus. Here we employ two strategies to illustrate the presence of a rich signal in the presented dataset: BOLD responses correlated with a known sub-structure of the movie stimulus (portrayed emotions) and within-subject similarity of BOLD responses between this audio-visual movie stimulus and the previous audio-only movie study.
BOLD response correlation with stimulus structure: portrayed emotions
One concern when using complex natural stimuli for research focusing on a particular cognitive function is the possibility that relevant neural signals are shadowed by other simultaneous cognitive processes. We tested whether it is possible to extract the signature of a particular aspect of brain activity under these conditions for the case of the response to emotional aspects of the stimulus.
In affective neuroscience, it is commonly assumed that observing emotional cues faces, words, characters, and social interactions elicits physiological brain responses similar to subjective experienced emotions—especially while viewing film clips13,14. Consequently, modeling BOLD responses in terms of the stimulus' temporal dynamics with respect to portrayed emotions as well as their carrier modalities should reveal brain areas commonly associated with emotion processing. The pattern of activated brain areas need not be identical to those reported in studies using highly controlled experiments—as a more ecologically valid stimulus may be processed differently—but a substantial overlap is nevertheless to be expected.
Here, we used a previously published annotation of emotions portrayed in the movie stimulus3 to test for brain responses that occur over a variety of scenes, independent of character and context. To validate the selectivity of our procedure, we analyzed modality-specific emotion cues (visual or auditory). Additionally, we selected perception of self- versus other-referenced emotions in the movie as a higher-level aspect of affective processing. While the former analysis should yield modality-specific activations in sensory cortices, the latter analysis is expected to reveal areas implicated in social cognition—especially the representation of knowledge about others15.
In the present study, we used probabilistic indicators for the perception of emotion aspects based on the fraction of human observers reporting their presence for any time point in the movie (1 Hz sampling rate)3. Each indicator encodes the inter-observer agreement in the interval [0, 1]: zero indicating no evidence for the presence of a property and 1 representing total agreement for the portrayal of a property across observers. The design matrix for a GLM time series model comprised of the following regressors: high arousal, low arousal (evidence for a deviation from an average or normal state of arousal), positive valence, negative valence (bipolar coding, analog to arousal), self-directed emotion, and other-directed emotion. In addition, the design matrix included regressors indicating evidence for a portrayed emotion from verbal, non-speech audio cue (separately), or via facial expressions, gestures, or other context cues. An additional boxcar regressor-of-no-interest indicated the presence of any speech in the stimulus, regardless of its emotional content. Lastly, a 6-parameter motion estimate was included.
A standard mass-univariate analysis was conducted with FEAT 6.0 from FSL 5.0.9 (ref. 16) that performed slice time correction, spatial smoothing (Gaussian FWHM 5 mm), and highpass temporal filtering (Gaussian-weighted least-squares straight line fitting, with sigma 50 s) as part of the pre-processing. Statistical analysis was carried out using FILM with local autocorrelation correction17. Time series analyses were performed for each movie segment separately, and were subsequently aggregated for each subject by averaging in a second-level analysis. Lastly, a group analysis was conducted testing for a mean effect in the group of all 15 subjects (random effects model; FLAME1) for the following contrasts: high>low arousal, positive>negative arousal, and self >other-directed emotion, as well as the respective reversed contrasts. Except for the negative >positive valence, there were significant clusters (threshold Z >2.3; P<0.05, corrected) of correlated signal for all conditions. The modulation of valence is rather inhomogeneous across movie segments, in particular for episodes with negative valence, and the approach of averaging results across segments might have impaired the analysis. Figure 1 shows the spatial distribution of these clusters for a subset of the results. Unthresholded maps for all contrasts are available on NeuroVault (http://neurovault.org/collections/1065).
These results are notable for several reasons. First, group results for correlation with visual and auditory emotion cues show pronounced network activity in early visual areas as well as medial prefrontal areas (see Fig. 2). The former is possibly an indication of a stimulus confound or a mediation of perceptual processes for emotional stimuli while the latter activation, comprising parts of anterior cingulate cortex as well as orbitofrontal (BA 24,32,10) regions, have previously been described by various studies as affective divisions of the medial frontal cortex18,
Second, when contrasting the perception of other-directed emotions from self-directed emotions, activation patterns in the more dorsal parts of the cingulate cortex as well as areas in posterior parietal cortices appeared. The latter finding nicely supports meta-analytic findings regarding theory-of-mind (ToM) studies—suggesting that posterior parietal cortices are connected to the medial prefrontal cortex, thereby constituting a basic network for ToM22,23.
The selected results are evidence that these data contain reflections of emotional properties of the movie stimulus and that brain responses to individual aspects of portrayed emotions can be located by means of simple linear models. Furthermore, these results provoke a series of questions that could be studied using this dataset: Are dimensional or discrete emotion effects only visible in ‘emotion areas’ or do they mediate perceptual processing? How does contextual information, such as social interactions or empathetic identification processes, influence the processing of perceived emotions? Additional relevant stimulus annotations are already available3 and further descriptions of this stimulus can enhance its utility as a reference for building upon previous movie stimulus studies on, for example: social perception24, facial expression25, fear26, humor27, sadness/amusement28, disgust/amusement/sexual arousal29, sadness30, happiness/sadness/disgust31, and emotional valence32.
Within-subject response congruency to audio-only movie data
Except for the audio description content and low-level stimulus features due to equipment and filtering differences, the auditory aspect of the stimulation performed here was largely identical to the previous 7 Tesla fMRI acquisition2. Consequently, a substantial similarity of BOLD response time series between the two acquisitions is to be expected in brain areas associated with auditory and speech processing. We tested this with a voxelwise within-subject correlation analysis using the 14 participants for whom both 3 Tesla and 7 Tesla data were available. For each movie segment, the two respective time series images were motion-corrected (MCFLIRT) and spatially aligned to each other by means of a rigid-body transformation (FLIRT) and resliced to 2.5 mm isotropic resolution. Images were subsequently smoothed with a Gaussian low-pass filter (4 mm FWHM; NiLearn). Voxel time-series were filtered by regressing out residual motion (implemented in PyMVPA using regressors estimated by the motion-correction algorithm) and band-pass filtered using a Butterworth filter (8th order, −3 dB points at 16 s and 250 s; implemented in SciPy). Following this movie-segment-wise pre-processing, voxelwise Spearmann rank-correlation coefficients were computed for corresponding movie segments from the audio-only and audio-visual experiment.
Correlation maps for all participants and movie segments were converted to standard Z-scores using Fisher transformation while normalizing to unit variance using , where n is the number of fMRI volumes in the respective movie segment. All Z-maps were then projected into MNI152 image space using participant-specific non-linear warps (FNIRT) and averaged across all participants and segments. Cluster-thresholding was performed using FSL's easythresh tool, using the brain intersection mask of the 7 Tesla (non-full brain) acquisition as the constraint.
Clusters of significant correlation between the two stimulation types could be observed in two large bilateral regions covering the auditory cortices, anterior and posterior STS, clusters located at Broca's area (BA44/45, bilaterally; speech processing), and the precuneus, (all Z >3.1, P<0.05, cluster-corrected). At a more liberal threshold (Z >2.3, P<0.05, corrected), two large bilateral clusters also contain patches in the paracingulate gyrus/BA9 (implicated in story processing and mentalizing33). Notably, this also includes areas in the temporal occipital fusiform cortex—associated with face and scene perception—despite the lack of relevant visual stimulation in the audio-only movie experiment. Figure 3 shows the Z-map thresholded at Z >2.3. The unthresholded map is available at http://neurovault.org/images/14268 for interactive inspection.
Eye tracking data quality
In order to assess the timing accuracy of frame presentation throughout the movie, for each movie segment, and for each participant, we extracted each movie frame's onset time (as recored by the eye tracker from messages sent by the stimulus computer) and computed histograms of frame durations (Fig. 4a). During the in-scanner experiment, the refresh rate of the presentation device was 60 Hz (1 refresh every ≈16 ms) and the movie frame rate was 25 Hz (1 frame every 40 ms). The overall distribution of durations peaked around two values (32 and 48 ms). More than 95% of the frames lasted for a duration of 32±7 or 48±7 ms. During the laboratory experiment, the refresh rate of the presentation device was 120 Hz (1 refresh every ≈8 ms). The overall distribution of durations peaked around two values (24 and 49 ms). More than 99% of the frames lasted for a duration of 24±1 or 49±1 ms. The overall timing stability of the in-lab sample was higher than the in-scanner acquisition. This is likely due to the lack of a direct synchronization signal between the stimulus computer and projector in the MRI-stimulation setup.
Due to the nature of the eye tracking technology, some signal loss is inevitable—primarily due to eye blinks; these lost samples are marked as
nan (not-a-number) in the data files. For each participant, for each movie segment, the overall quantity of no-signal samples can be considered as an index of information quantity. For the majority of all participants, the signal loss is less than 10% (Fig. 4b) across all movie segments. For 2 in-scanner participants out of the 15, more than 35% of the samples contain no signal (specifically: 85 and 36%). The amount of unusable samples for the remaining 13 participants ranges from less than 1% to about 15%. In-lab acquisitions yielded more contiguous data. Even the sessions with the highest amounts of loss included about 85% of usable data. Importantly, the movie content does not modulate the amount of no-signal samples: there is not a specific segment which shows, across the participants, significantly fewer usable samples with respect to the other segments. This holds for both the in-scanner and in-lab acquisitions.
Spatial accuracy and reliability of eye gaze coordinates were estimated via dedicated 13-point test fixation sequences that were performed by each participant immediately prior to and after each movie segment. The unit of the presented results are movie pixels (reference movie frame size: 1280×546 px) to facilitate comparison between the in-scanner and in-lab recordings that employed different screen resolutions and physical dimensions.
Considering the average spatial error across any of the 13 calibration points on the screen prior to the start of the movie segments, the average spatial error across all participants and movie segments was 31 px (24/152/21 px median/max/std). The average error for measurement immediately following the movie segments increased to 41 px (36/170/26 px median/max/std). The average within-subject increase of spatial error from the pre to the post movie segment test was 10 px (7/113/21 px median/max/std). This suggests an average spatial uncertainty of gaze coordinates of 40 px, corresponding to 0.75 of visual angle and a relatively stable coordinate uncertainty across the duration of the movie segments. Cases of extreme average error (more than two standard deviations from the mean) in the pre or post test are reported in Table 1.
The analog estimates of spatial gaze coordinate error for the laboratory acquisition are generally lower and more homogeneous. The average pre-segment error is 17 px (16/37/5 px median/max/std), the post-segment error estimate is 37 px (32/155/22 px (median/max/std), and the average within-subject pre/post-segment error increase is 20 px (14/143/24 px median/max/std). This suggests that more optimal optical conditions initially lead to more accurate gaze coordinates. However, accuracy degradation through the time course of the movie segments is more severe in comparison to the in-scanner acquisition—presumably due to superior head movement constraints in the head coil. Again, cases of extreme average error (more than two standard deviations from the mean) in the pre or post test are reported in Table 1.
Gaze distribution congruency across participants
In order to compare the in-lab and the in-scanner eye gaze recordings, we analyzed the spatial gaze distributions for every movie frame. Gaze distribution across all participants can be considered as an indicator of the magnitude of bottom-up attention modulation. Movie frames with clearly defined components of high saliency should lead to automatic attention capture and result in relatively synchronous eye movement across individuals. In contrast, movie frames without high-salience features, or temporally static movie frames after a period of initial exploration, should yield less synchronized and less coherent gaze locations across participants. We tested whether a quantification of between-subject gaze-diversity can be used to assess the synchronicity of our eye gaze recordings between the in-lab and in-scanner samples.
We computed a diversity score for every movie frame (duration 40 ms) as the summed absolute difference of each cell of a uniform (flat) 2D gaze histogram and the empirical gaze histogram from the experiment (extent of a single histogram bin was 26×26 px). The latter was generated by composing a heat map from all gaze locations (one averaged location per participant per frame), smoothed by a Gaussian kernel with a standard deviation of 40 px (the estimated spatial accuracy of the eye gaze recordings). High values indicate a high diversity of focus points across participants; low values indicate fewer and/or spatially adjacent focus points. Figure 4c illustrates this score with exemplary movie frames and their empirical gaze distribution for one average and two extreme cases from an arbitrarily selected one-minute segment of the stimulus movie.
Figure 4d depicts the temporal dynamics of this diversity score for 3 s prior and 4 s after each movie cut, averaged across the more than 850 cuts in the movie. A cut in a movie results in a sudden change of the visual stimulation and is assumed to yield maximum stimulus-driven attention modulation, leading to a decrease in the diversity score. We indeed observe such a global decrease with a local minimum at about 350 ms after a cut. Moreover, Figure 4d documents highly similar temporal dynamics for the in-lab and the in-scanner sample, with the location of the respective minimum being only 50–80 ms apart. This difference could result from timing inaccuracies in the in-scanner sample, or from a slowed behavioral response in the scanner environment (participants resting on their backs). Prior the diversity minimum, we observe a diversity maximum at around 100 ms after a cut. One possible explanation is that some portion of the movie frames following a cut have multiple highly salient content locations to which saccades are performed immediately, but the saccade target differs across participants due to relatively equal salience. This question can be explored with a more detailed analysis of the provided eye gaze recordings.
Overall, we conclude that our diversity score captures relevant information about the gaze location congruency across participants, and that we observe highly similar temporal dynamics in both participant samples—indicating that these data are suitable for cross-sample comparisons.
The procedures we employed in this study resulted in a dataset that is highly suitable for automated processing. Data files are organized according to the BIDS standard12. Data are shared in documented standard formats, such as NIfTI or plain text files, to enable further processing in arbitrary analysis environments with no imposed dependencies on proprietary tools. Conversion from the original raw data formats is implemented in publicly accessible scripts; the type and version of employed file format conversion tools are documented. Moreover, all results presented in this section were produced by open source software on a computational cluster running the (Neuro)Debian operating system9. This computational environment is freely available to anyone, and it—in conjunction with our analysis scripts—offers a high level of transparency regarding all aspects of the analyses presented herein.
All data are made available under the terms of the Public Domain Dedication and License (PDDL; http://opendatacommons.org/licenses/pddl/1.0/). All source code is released under the terms of the MIT license (http://www.opensource.org/licenses/MIT). In short, this means that anybody is free to download and use this dataset for any purpose as well as to produce and re-share derived data artifacts. While not legally required, we hope that all users of the data will acknowledge the original authors by citing this publication and follow good scientific practise as laid out in the ODC Attribution/Share-Alike Community Norms (http://opendatacommons.org/norms/odc-by-sa/).
How to cite this article: Hanke, M. et al. A studyforrest extension, simultaneous fMRI and eye gaze recordings during prolonged natural stimulation. Sci. Data 3:160092 doi: 10.1038/sdata.2016.92 (2016).
Hanke, M. OpenfMRI ds000113d (2016).
We acknowledge the support of the Combinatorial NeuroImaging Core Facility at the Leibniz Institute for Neurobiology in Magdeburg. Michael Hanke was supported by funds from the German federal state of Saxony-Anhalt and the European Regional Development Fund (ERDF), Project: Center for Behavioral Brain Sciences. Vittorio Iacovella was partially supported by the Autonomous Province of Trento, Call ‘Grandi Progetti 2012’, project ‘Characterizing and improving brain mechanisms of attention—ATTEND‘. This research was, in part, co-funded by the German Federal Ministry of Education and Research (BMBF 01GQ1112, 01GQ1411) and the US National Science Foundation (NSF 1129855, 1429999) as part of two US-German collaborations in computational neuroscience (CRCNS).
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0 Metadata associated with this Data Descriptor is available at http://www.nature.com/sdata/ and is released under the CC0 waiver to maximize reuse.