Multimodal dataset of real-time 2D and static 3D MRI of healthy French speakers

The study of articulatory gestures has a wide spectrum of applications, notably in speech production and recognition. Sets of phonemes, as well as their articulation, are language-specific; however, existing MRI databases mostly include English speakers. In our present work, we introduce a dataset acquired with MRI from 10 healthy native French speakers. A corpus consisting of synthetic sentences was used to ensure a good coverage of the French phonetic context. A real-time MRI technology with temporal resolution of 20 ms was used to acquire vocal tract images of the participants speaking. The sound was recorded simultaneously with MRI, denoised and temporally aligned with the images. The speech was transcribed to obtain phoneme-wise segmentation of sound. We also acquired static 3D MR images for a wide list of French phonemes. In addition, we include annotations of spontaneous swallowing.

MRI with temporal resolution of 7 frames per second of one female Portuguese speaker was published in 26 . Static 3D MR images of five Japanese vowels pronounced by one male speaker are presented in 27 . A dataset of 3D vocal tract shapes was published recently 28 for two German native speakers. A 2D dynamic with 3D static MRI database including 2 male French speakers was also acquired earlier 29 . Nevertheless, the available data does not allow exhaustive investigation of these languages. Moreover, all the existing publicly available databases offering high spatio-temporal resolution dynamic MRI, exploit similar acquisition technologies due to the fact that they are acquired by the same research team. Availability of datasets of different qualities could serve to get better precision in some aspects.
In this work we report on a multi-modal MRI database consisting of 2D real-time and 3D static MR images of the vocal tract of 10 French speakers. The protocol used for the real-time MRI acquisitions for our dataset was successfully used by multiple groups in the context of the study of the articulators' motion [30][31][32][33] . While performing investigations on the vocal tract organs, it is crucial to consider the diversity of their movements during speech production. Standard French language includes 35 phonemes (18 consonants, 14 vowels, 3 semi-vowels) which form 1290 diphones 34 and many complex consonant clusters. To cover this variability as much as possible, a corpus was previously developed 29 . The corpus allows to explore numerous phenomena specific to the French language such as nasal vowels, uvular /ʁ/ 35 , French /y/, short /ɥ/, and strong anticipation of labial features 36 . The dataset includes annotations of the speech and of spontaneous swallowing and will thus provide researchers with data having a good coverage of the French phonetics to further explore French speech production and physiological processes taking place in the vocal tract vicinity. Methods participants and speech task. The participants were 5 male and 5 female native French speakers (aged 29 ± 8 years) without any speech or hearing problems. Presence of any metal in the vocal tract vicinity, which may generate susceptibility artifacts, was also an exclusion criterion. Relevant patient characteristics are listed in Table 1. A set of mid-sagittal images demonstrating speakers' anatomy is presented in Fig. 1.
All participants provided written informed consent, including written permission to publish the materials of this experiment. The data was recorded under the approved ethical protocol "METHODO" (ClinicalTrials. gov Identifier: NCT02887053). The study was approved by the institutional ethics review board (CPP EST-III, 08.10.01).
The previously designed corpus 29 was presented in form of a pdf-file (see Supplemental Materials) which was projected on a screen in the MRI room so that a speaker could read it during the experiment without difficulty. The corpus included two parts. The first one served for the acquisition of dynamic 2D data and included 77 sentences which were constructed to provide an almost-exhaustive coverage of the French phonetic contexts of vowels /i,a,u,y/ and some nasal vowels selected from / , o, α ε/. Several levels of criteria were used to guide the manual construction of those sentences. After the insertion of a new sentence the first level of criteria evaluated was the number of VV for all the vowels, the number of CV for C in /p, t, k, f, s, ∫, l, ʁ, m, n/ and V in /i, a, u/ plus /y/, the number of VC with C as a coda and C in /l, ʁ, n, m/ and V in /i, a, u, y, e, ɛ, o, ɔ/, the consonant clusters C1C2V with C in /p, t, k, b, d, g, f/, C2 in /ʁ, l/ and V in /a, i, u, y/ (the other CCV following the same pattern with /s, ∫, v/ are rare in French), and VC, of C in a coda, and 15 complex consonant clusters (at least a sequence of 3 consonants, between two vowels). Except for those clusters and with very few exceptions all the contexts appear within words to avoid the effect of prosodic boundaries. This first level of criteria covers the very heart of the corpus in terms of mandatory phonetic contexts. We wanted well-constructed French sentences and therefore words not corresponding to the target contexts were added. They provide new contexts, and in particular contexts with vowels outside the set of cardinal vowels plus /y/. VCV are counted by considering groupings of close vowels. There are 6 groups of vowels (/i, e/, /ɛ, a/, /u, o, ɔ/, /y, ø/, /oe, ə/ and nasal vowels / , o, α ε /. This provides a second level of evaluation which helps required words to build well-constructed sentences. Other words required to form well-constructed sentences provide additional contexts with the remaining vowels. All the words (except "cartoons" and "squaw") are of French origin to avoid any ambiguity of pronunciation. The sentences were divided on groups of 3-5 to ensure comfortable duration of a session. The speech task for 2D real-time acquisition is presented in Online-only Table 1.
The second part comprised 3D static data acquisition and included 5 silent acquisitions with different positions of the tongue (against the upper teeth, against the lower teeth between the incisors, retroflex and deep www.nature.com/scientificdata www.nature.com/scientificdata/ retroflex), 14 vowels and 12 consonants (each in context of several vowels). The full set of the phonemes and positions included into the speech task for 3D static acquisitions, is listed in Online-only Tables 2 and 3. The first three positions were designed to help the registration of teeth within MRI data. The participants were asked to keep the same position before and up to the end of the MRI noise in case of silent positions or vowels. For a consonant C in context of a vowel V they were instructed to phonate V until the end of the countdown of an MRI operator who then started a sequence, then to keep the articulation of C until the end of the MRI noise and then phonate V again (the consonant production in this case is called blocked articulation). This helped speakers reach the expected articulatory position for each of the consonants articulated within a given vowel context. Duration of each sequence for the static 3D data acquisition was chosen to be 7 seconds, as a compromise between the volunteers' comfort and the image quality. While sometimes it was not very natural to keep the same position for such time, especially in the case of plosive consonants, the task appeared to be absolutely feasible.
The presentation was sent to the volunteers before the experiment so that they had at least one day to get familiar with the sentences which sometimes sound strange even if they are all well-formed French sentences. Additionally, the speakers were carefully instructed directly before the acquisition to guarantee correct understanding of the task.
Data acquisition and alignment. The MRI data was recorded at Nancy Central Regional University Hospital on a Siemens Prisma 3 T scanner (Siemens, Erlangen, Germany). The speakers were in supine position and the Siemens Head/Neck 64 coil was used.
For the 2D real-time we used radial RF-spoiled FLASH sequence 13 with TR = 2.22 ms, TE = 1.47 ms, FOV = 22.0 × 22.0 cm, flip angle = 5°, and slice thickness was 8 mm. Pixel bandwidth was 1670 Hz/pixel. Image size was 136 × 136, and in-plane resolution was 1.6 mm. Images were recorded at a frame rate of 50 frames per second and reconstructed with a nonlinear inverse technique presented in 13 . This method represents a formulation of a nonlinear optimisation problem with respect to both image and coil sensitivity maps which is solved iteratively with regularized Gauss-Newton method. The protocol used for dynamic data acquisition differs from the protocols of the published publicly available databases, and thus the quality is also somewhat different. Radial encoding trajectories are shorter than spiral ones, so that the repetition time is lower in our case (2.22 ms comparing to 6.004 ms in the latest dataset 25 ) which probably decreases the manifestation of the off-resonance effects. Also, some residual aliasing artifacts can be remarked in case of the datasets 16 and 15 . However, due to the larger slice thickness (8 mm comparing to 6 mm), the partial volume effects are more pronounced in our case. Protocol 17 proposes higher temporal resolution (83 frames per second 24,25 ), while the protocol used in our case offers higher in-plane spatial resolution.
Audio was recorded at a sampling frequency of 16 kHz inside the MRI scanner by using a FOMRI III optoacoustics fibre-optic microphone (FOMRI III, Optoacoustics Ltd., Mazor, Israel) placed in the scanner. The volunteers wore earplugs to be protected from the scanner noise, but were still able to communicate orally with the experimenters via an in-scanner intercom system. Since the sound was recorded at the same time with the MRI acquisition, some additional noise is present in the audio signal. In order to suppress the noise, we used the algorithm proposed in 37 . The algorithm relies a on the hypothesis that a noisy sound represents a gaussian mixture of two components (voice and MRI noise in this case), and the decomposition is based on an expectation-maximisation algorithm. The components are characterized by some source and resonator spectral features. A clear speech recording, which was required by the algorithm, was done separately for each speaker just before the dynamic MRI acquisition starts, with the same patient and microphone positions. www.nature.com/scientificdata www.nature.com/scientificdata/ In order to align the sound with MRI data, we used Signal Analyzer and Event Control system (SAEC) which had previously been designed 38 . We applied it to record timestamps of the reconstruction start and the sequence stop events and send them to a channel of the opto-acoustic system which allows MRI transistor-transistor logic (TTL) commands recording. This enabled automatic synchronisation of the images and the sound. However, during manual examination it was found that the images are somewhat shifted with respect to the sound after the automatic alignment. A shift of 2 images (40 ms) was explained by the application of the temporal median filter of width 5 (100 ms) during reconstruction. Nevertheless, due to some temporal variations of the TTL signals and/ or sound reception, probably caused by USB jitter, the temporal shift between the images and the sound slightly varied from series to series. Thereby, the shifts were required to be manually defined for each series by comparison of sound amplitude and/or spectrograms with MR images. From a practical point of view, for this purpose we selected the events which are clearly visible on both the acoustic signal and the images: onsets of the sounds /p/ and /b/. This event corresponds to the time moment where the lips contact occurs and results in an abrupt and considerable acoustic signal weakening. In most of the recorded sequences there were /p/ and /b/ near the recording extremities. If this was not the case, the fallback solution was to use the contact between the tongue tip and teeth alveoli for /t/ and /d/ which also corresponds to a strong weakening of the signal. Thus, the full pipeline consisted of four steps. 1. Data acquisition. 2. Automatic alignment using the TTL timestamps. 3. Sound cropping, so that it fits the MRI acquisition time interval, which was necessary for the denoising, and the denoising itself. 4. Manual shift determination and re-alignment. The illustration of the different steps of the alignment routine is given in Fig. 3.

Transcription of the continuous speech corpus.
Speech transcription is the temporal sentence-wise, word-wise or phoneme-wise segmentation of an audio recording. Together with the synchronisation, it provides correspondence between images and pronounced phonemes which is helpful for many practical applications. Even though the speech task was pre-defined, some speakers made mistakes or repeated some syllables or words. To take these deviations with respect to the expected pronunciation into account, each recording was inspected by an investigator and all the hesitations or repetitions were included to the text used for the alignment. The transcription was done in two steps. First, the sentences start and end were manually annotated from the sound, which was denoised and cut to fit the MRI acquisition time interval (see Fig. 3). This was done using Transcriber 1.5.2 (http://trans.sourceforge.net/en/presentation.php). This software generates .trs files which include the timestamps and the corresponding text. Text annotations and audio signal were synchronized by a forced alignment automatic speech recognition system Astali (http://ortolang108.inist.fr/astali/) trained on French. It used the .trs text annotations together with the denoised sound to perform the temporal segmentation (both word-wise and phoneme-wise). The phonemes are stored inside a file using SAMPA phonetic annotation system 39 .
Swallowing detection. Swallowing is defined as a series of mechanisms allowing transportation of food, drinks or saliva to the stomach. This mechanism occurs in four steps: (1) A pre-emptive phase, with lip closure; (2) an oral phase, which corresponds to the bolus transportation from the front to the posterior area of the oral cavity, in order to reach the pharynx; (3) a pharyngeal phase, where the bolus continues its way through the pharynx, and (4) an oesophageal phase allowing the bolus to penetrate into the stomach.
The oral phase is particularly interesting, owing to the complexity of the muscle contractions and anatomical movements achieved during this step. During this phase, which lasts about one second, the mandible is stabilized when the teeth are in contact and in maximal intercuspation occlusion (MIO), after lip closure. At the same time, the tongue initiates a propulsion movement which begins in the anterior hard palate, and then performs a second contraction to propel the bolus at the rear up to the pharyngeal areas. Following this contraction, the oropharynx is closed to prevent bolus penetration into the upper airways (UA). The bolus can then continue its way to the pharynx.
Swallowing pathologies are numerous, and imaging devices remain limited to study in real-time physiological movements of the anatomical structures of the upper airways. In order to facilitate this observation and identify the oral phase of swallowing, we propose a protocol to determine the start and end positions of swallowing on our MRI images in real time. (1) The image counting starts when the apex of the tongue touches the hard palate. However, for some images recorded during speech, the tongue might already be in this position while swallowing 3D SAG AX COR 3D SAG AX COR P6 P10 Fig. 2 Examples of 3D static images: P6 pronouncing /l/ in context of /a/, and P10 pronouncing /ã/. The 3D volumes were cropped for better visibility.
www.nature.com/scientificdata www.nature.com/scientificdata/ begins. In these cases, we took as a landmark the most anterior contact of the dorsal part of the tongue with the hard palate. (2) At the beginning of the oral phase of swallowing, the elevation of the hyoid bone is observed. The end of the oral phase has been described as the moment when the space between the tongue and the velum and/ or the soft palate reappears. At the same time, the hyoid bone returns to its initial position, and the oropharynx relaxes to allow the ventilatory flow to recover.

Data Records
The data is available on figshare 40 . Each of the 10 folders contains the data of a speaker. To summarize, a folder (except the second one) contains 16 dynamic and 76 static series. Each of the dynamic series counts 1800 to 2200 frames and has duration about 1 minute, which results to overall 34800 images (approximately 15 minutes) per subject. Each static series consists of 36 slices.
The dataset is organized as follows: the root directory contains ten speakers' folders with names "PXX" (XX here is a patient code). The speaker data is divided into three folders: DCM_2D, DCM_3D and OTHER. Inside DCM_2D and OTHER folders there are 16 subfolders with names SYY (YY is a series number) which correspond to the different dynamic acquisitions. Files stored inside the subfolders of DCM_2D are DICOM files with the MRI 2D dynamic data, and files contained inside subfolders of the OTHER folder are the corresponding denoised cropped sound, TEXT_ALIGNMENT_PXX_SYY.trs file (sentence alignment), TEXT_ALIGNMENT_ PXX_SYY.textgrid file (alignment of words and phonemes), SWALLOWING_PXX_SYY.trs which has the same format as "sentence alignment" and contains swallowing timestamps, and an example video VIDEO_PXX_SYY. avi generated with the provided code (after compression).
The 3D static data can be found inside the subfolders of the DCM_3D folder. The name of those subfolders corresponds to the target phoneme and possibly its vowel context, for instance "tu" for /t/ in the vocalic context of /u/. There are also three static positions to help determine the position of teeth which are not visible on MRI scans. They are denoted as UP (for the tongue touching the upper incisors), DOWN (for the tongue touching lower incisors) and CONTACT (for incisors in contact). The latter was used to check. consistency of the teeth positions (as determined by UP and DOWN).
The correspondence between the folder names and the speech task performed in frames of an MRI series, is given in the Online-only Tables 1-3. The dataset structure is also illustrated in Fig. 4.
The text alignment data are the Transcriber sentence annotation files (.trs files) which include pronounced text and the timestamps of start and end of each sentence on the one hand, and word and phoneme segmentation files (.textgrid Pratt files) 41 on the other hand. The code which allows reading these files is provided.
For P2, 3D static images of consonants phonation were not acquired because of technical reasons, thus only dynamic 2D data and 3D static images of vowels and silent positions were included. All other speakers performed the full list of acquisitions. www.nature.com/scientificdata www.nature.com/scientificdata/

Technical Validation
The dynamic 2D images were visually inspected by the researchers. In general, the images quality was good. Some minor artifacts can be observed as it is pointed out in Fig. 5. Some partial volume effects occurred because of relatively large slice thickness (8 mm). Despite the temporal filter, some residual radial aliasing artifacts were observed. Very fast articulatory gesture can lead to blurring, for example when the tongue tip approaches the alveolar region for /t/, /d/ or /l/.
The video-sound alignment was verified by a researcher having more than 20 years of experience in the field (author Y.L.). This was done by cross checking of the sound track and images using the videos included to the database as described in Methods section. In the case a misalignment found, a proper shift was applied to the sound.
The text alignment was checked by authors K.I., P.-A.V. and Y.L by comparing the sound, the images and the text. Some errors caused by the fact that certain speakers made many pronunciation errors and/or hesitations were generally corrected, however some residual errors can still have place. The list of the corrected mistakes is presented in Online-only Table 4. An experienced reader can see that there are some minor phoneme-wise sound segmentation errors (order of 1 frame which corresponds to 20 ms) in the case of plosive consonants (explained by the fact that almost no sound is produced). We chose to keep the automatic annotations.
The vocal tract shape during 3D static data acquisition, which should correspond to a required phoneme, was visually inspected directly during the course of the experiment by authors K.I. and P.-A.V. In case of obviously wrong positions, the data was reacquired up to three times. The resulting 3D images were checked by author Y.L. The cases of phoneme/image inconsistency for the static 3D data are given in the Table 2. In addition to individual comments given in the Table 2, here are some general trends. Blocked articulations (freezing a position just before producing a consonant) is not a natural gesture in speech production. It is especially difficult to control the velum position since there is no acoustic feedback. This explains why the velum is in a lower position in some cases where it is expected to be in a higher position, for instance stop consonants. Some speakers who were not familiar … WA UP … S2 S16 S1 S16 Fig. 4 The dataset structure illustrated with the example of the second dynamic series and the 3D static image of the consonant /w/ in context of /a/ of the P3 data.

Fig. 5
Artifacts and imperfections that can occur illustrated on the images of P7. Blue arrows point to an aliasing artifact, which, howeher, does not affect the quality of the vocal tract imaging. Yellow arrows point to motion artifacts that take place in case of rapid change of the articulators' position (like the transition from /l/ to /a/, and from /a/ to /k/). The white ellipse points to a tongue region with slightly lower intensity which is caused by relatively large slice thickness and tongue shape variations in the left-right direction (partial volume effects).

P1
Good images. Very extreme retroflex shape.

P2
Only some images were recorded because of a technical problem.

P3
The subject did not understand instructions correctly (no contact for stops, lips not closed for /p/) but sustained sounds, i.e. vowels and fricatives, are correct.

P4
The overall quality of images is not very good. Oral vowel shapes are correct, and in to a lesser extent fricatives. Many blocked articulations do not exhibit expected features (contact for stops, velum position…).

P5
Very good images. /li/ was not articulated correctly.

P7
The contact between the fixed and the mobile articulators is not reached for several stop consonants. The velum is in the upper position for /m/, under-anticipation of the tongue position for /p/, exaggeration of the tongue tip position between upper and lower teeth for /t/, no contact at the place of articulation of some /k/. The articulators' positions are, in general, not very natural.

P8
Strong forward position of the mandible. Instead of being in the upper position the velum is in the lower position for many oral articulations. Several stop articulations without contact between the fixed and mobile articulators (for /p/ and /t/).

P9
The subject did not understand instructions despite several trials. Some vowels are articulated with the mouth closed and some tongue shapes are very unusual for /k/ which has not been articulated correctly.

P10
Good images. The velum is sometimes in the lower position (/a/ and /la/ for instance) for oral sounds. Table 2. Evaluation of articulatory shapes produced.