Printable 3D vocal tract shapes from MRI data and their acoustic and aerodynamic properties

A detailed understanding of how the acoustic patterns of speech sounds are generated by the complex 3D shapes of the vocal tract is a major goal in speech research. The Dresden Vocal Tract Dataset (DVTD) presented here contains geometric and (aero)acoustic data of the vocal tract of 22 German speech sounds (16 vowels, 5 fricatives, 1 lateral), each from one male and one female speaker. The data include the 3D Magnetic Resonance Imaging data of the vocal tracts, the corresponding 3D-printable and finite-element models, and their simulated and measured acoustic and aerodynamic properties. The dataset was evaluated in terms of the plausibility and the similarity of the resonance frequencies determined by the acoustic simulations and measurements, and in terms of the human identification rate of the vowels and fricatives synthesized by the artificially excited 3D-printed vocal tract models. According to both the acoustic and perceptual metrics, most models are accurate representations of the intended speech sounds and can be readily used for research and education.

to make them accessible to non-experts on volumetric MRI processing. The processing steps are summarized in Fig. 1. Because teeth are not visible in MRI data, but highly relevant for speech acoustics 17 , we created 3D scans of plaster models of the teeth of the subjects. The teeth scans were merged with the MRI data, and the vocal tract was segmented to yield triangle meshes of the vocal tract walls. From the segmented vocal tract surfaces, we created finite element (FE) models for acoustic simulations, as well as 3D-printable solid volume models. The FE models were used to calculate the volume velocity transfer functions of the vocal tract from the glottis to the lips (up to 10,000 Hz). The solid volume models were 3D-printed and used to measure the acoustic transfer functions. The calculated and measured transfer functions were compared in terms of their resonance frequencies. To characterize the models aeroacoustically, stationary airflows at different fluid power levels were injected through the glottis of the 3D-printed models, and the noise generated by the turbulence was measured in front of the lips of PSD Fig. 1 Overview of the data acquisition and processing. The shown images and spectra are for the tense vowel/e:/of speaker 1. XX in the file names is a placeholder for the speech sound labels.
the models. Finally, to check whether the processed vocal tract shapes in the dataset are still sufficient representations of the respective speech sounds, the 3D-printed vocal tract models were acoustically excited (vowels with a reed source at the glottis and fricatives with a stationary airflow through the glottis) to generate artificial speech sounds, and human listeners were asked to identify them in a perception experiment.
There is a range of potential applications for the provided data. For example, they could be used • to create articulatory models of the vocal tract or individual articulators based on the provided vocal tract shapes [18][19][20][21][22][23][24] . • to validate computational acoustic models of the vocal tract, especially simplified (and fast) 1D and 2D models 11,[25][26][27] , using the provided vocal tract geometries along with the transfer functions as reference. • to validate computational aeroacoustic models that simulate noise generation in the vocal tract 28,29 , using the provided vocal tract geometries along with the turbulence noise measurements as reference. • to assess the acoustic effects of certain geometric features or side cavities of the vocal tract, like the piriform fossae 6,30 , the interdental spaces, or the vallecula, based on the 3D-printable models. • as teaching tools to demonstrate the relationship between vocal tract shape and acoustics, based on the 3D-printable models. With suitable physical mechanisms for the voiced excitation 31 , e.g., a reed pipe as described by Arai 32,33 , the 3D-printed models can be used to synthesize different vowels. • in combination with other MRI or CT datasets, to study questions of morphology and anatomic development, gender differences, or inter-speaker anatomic or articulatory variability of the vocal tract [34][35][36][37][38] .

Methods
This study was conducted according to the ethical principles based on the WMA Declaration of Helsinki and to the current legal provisions. It was approved by the ethics committee of the TU Dresden, and informed consent was obtained from the subjects.
Subjects and speech sounds. Vocal tract shapes of sustained speech sounds were acquired from two native German speakers, one male and one female. The male subject (s1) was 39 years old, 1.85 m tall, and grew up in the state Mecklenburg-Vorpommern (Mecklenburg-Western Pomerania) in Germany. He was a professional phonetician and speech scientist and lecturer at the university. The female subject (s2) was 32 years old, 1.64 m tall, and grew up in the state Sachsen (Saxony) of Germany. She did her studies in speech science, which included professional speech training. Furthermore, she is a trained singer and has been singing in several semi-professional choirs since her childhood.
Each subject produced 22 sustained speech sounds while a volumetric MRI scan of their vocal tract was performed. The data were processed and analyzed as discussed below. The list of speech sounds is given in Table 1 and contains 8 tense vowels, 8 lax vowels, and 6 consonants. The subjects were asked to pronounce each sound like in the word given in the table. The two rightmost columns contain the unique labels for the vocal tract shapes used in the dataset and in the remainder of this paper. In the following, they will be referred to by the placeholder XX.
Acquisition of MRI and reference audio data. The MR images of the vocal tract were acquired on a Siemens 3 T TIM Trio with a 12-channel head coil combined with additional neck elements. The sequence was a sagittal 3D volume interpolated gradient echo sequence (VIBE -fl3d-vibe) with 1.2 mm × 1.2 mm × 1.8 mm resolution, 44 sequential slices, matrix size 192 × 192, field of view = 230 mm × 230 mm, repetition time TR = 5:53 ms, echo time TE = 2:01 ms, flip angle 9°, Q-fatsat, 22 lines per shot, 7/8 phase partial Fourier, 6/8 slice partial Fourier, ipat factor 2 (PE only), 24 reference lines and a bandwidth of 220 Hz/pixel. The acquisition time for one volume was 14 s during which the speaker produced and sustained the corresponding speech sound. All 22 sounds per speaker were acquired in one session. After each scan, the image quality was carefully checked with respect to blurry parts or motion artifacts due to involuntary movements of the articulators during the 14 s scan time. Each scan was repeated as often as necessary to obtain a clean image.
Before the MRI sessions, the two subjects practiced to sustain the speech sounds for the required duration and with a High German quality. It was especially practiced to produce the lax vowels with the correct vowel quality, as they are normally produced as short vowels in German.
In addition to the MRI data, audio recordings of the speech sounds were obtained from both subjects. These recordings were not directly made during the MRI scans of the vocal tract because of the high noise level in the scanner. Instead, they were done in a separate session in a soundproofed audio studio using a studio condenser microphone (M930 by Microtech Gefell) connected to a mixing desk (Behringer Eurorack MX 1602) for power supply and preamplification. The signals were digitized with 44,100 Hz and 16 bit using the audio interface 896HD by MOTU and recorded with the Software Audacity 2.2.0 (https://audacityteam.org/) on a standard desktop computer.
The subjects were asked to produce the sounds as similar as possible to the situation in the MRI scanner and sustain the sounds for at least 10 s. The recordings were then symmetrically cropped around their center to a length of 1,000 ms for the tense vowels (which normally occur as long vowels in German) and the fricatives, and to a length of 200 ms for the lax vowels. They were then peak-normalized and windowed with a Tukey (tapered cosine) window for a fade-in and fade-out of 20 ms each. Finally, the audio signals were padded with 200 ms of silence at the beginning and end. The resulting audio signals are contained in the dataset as the files XX-reference-sound.wav. www.nature.com/scientificdata www.nature.com/scientificdata/ Measurement of maxilla and mandible shapes. For each subject, plaster models of the maxilla and mandible were created by means of alginate impressions according to the standard procedure used in dentistry 39 . The plaster models were 3D-scanned to obtain 3D boundary models of the objects (see top row of Fig. 1). Scanning was performed with a NextEngine 3D laser scanner and the corresponding NextEngine ScanStudio software. Each plaster model was scanned both from a horizontal (as in Fig. 1) and a vertical view (standing on the posterior side). In each position, the model was scanned from 7 angles in steps of 51.4° on the turntable with the following settings: Points per Square Inch: 4400 (SD); Target: Normal; Range: Macro. The individual scans per object were then aligned using corresponding points on the surface, fused into a single boundary model, and exported as a binary STL file. These are contained as the files s1-mandible.stl, s1-maxilla.stl, s2-mandible.stl, and s2-maxilla.stl in the dataset.

Segmentation of the vocal tract.
To obtain the inner surface representations of the vocal tracts from the MRI data, each vocal tract was processed according to the steps below. All required software tools were free and open source.
1. The boundary models of the maxilla and mandible were merged with the MRI data of the vocal tract shape using the software 3D Slicer 40 (www.slicer.org). The MRI voxel data were first upsampled to obtain smaller voxels with a uniform edge length of 0.25 mm. Then the triangle meshes of the maxilla and mandible were carefully positioned with respect to the MRI data using affine transforms. Finally, all voxels contained within the closed surfaces of the maxilla and mandible were set to a constant mid-level gray value. 2. The high-resolution voxel data from step 1 were used to segment the vocal tract with the software ITK-SNAP 41 (http://www.itksnap.org). The 3D segmentation was performed semi-automatically based on the implemented active contour method 42 . The nasal cavity was excluded from the segmentation, even when there was a slight velo-pharyngeal opening for some vowels. The segmentation result was a boundary model of the air-filled pharyngeal and oral cavities that extended slightly into the free space in front of the open mouth. 3. The closed boundary model obtained in step 2 was opened at the glottal end and the mouth using the software Blender (www.blender.org). The glottal end was opened with a cutting plane through the vocal folds, while the mouth was opened with curved cutting planes that were fitted to the shape of the lips. 4. The surface model opened at the glottis and the mouth was manually smoothed with a sculpting tool using Blender and a Laplacian filter using the software Meshlab 43 (http://www.meshlab.net). It was taken care that important details like the teeth, the uvula, and the epiglottis were not accidentally removed.
The triangle meshes of the inner vocal tract surfaces are provided as the files XX-inner-surface.stl in the dataset. www.nature.com/scientificdata www.nature.com/scientificdata/ creation of 3D-printable models and 3D-printing. To obtain 3D-printable models of the vocal tract, the inner surface meshes were converted into closed solids by giving the vocal tract walls a finite thickness. For each model, we first created an offset mesh as the exterior shell for the solid using the software Meshlab. The offset mesh was created at a distance of 4 mm outwards from the inner surface mesh for a wall thickness of 4 mm, and then trimmed using Blender. The outer shell was then smoothed and fused with the inner shell using Blender. The gaps between the meshes were closed and a uniform adapter (socket) was added to the glottal end of the model. The adapter was designed as a disk-shaped ring with a thickness of 4 mm and inner and outer diameters of 10 mm and 30 mm, respectively. The upper side of the ring was positioned flush with the glottal plane (inlet of the vocal tract). Hence, the glottal opening of all models consisted of a hole with 10 mm diameter. The complete set of volume models including the adapter is supplied as the files XX-printable-model.stl in the dataset. For easier 3D-printing, the models have also been halved through the midsagittal plane, and the two halves are represented by the files XX-printable-left-half.stl and XX-printable-right-half.stl.
Each vocal tract half was 3D-printed on an Ultimaker 3 printer, which uses fused deposition modeling and has two extruders. The vocal tract walls were printed with the material PLA (polylactic acid, brand "innofil") from one extruder, and support structures were printed with the water-soluble material PVA (polyvinyl alcohol, using the material sold by Ultimaker) from the other extruder. The layer thickness was 0.1 mm and the infill ratio was 100% for PLA (i.e. the walls were "solid" inside) and 20% for PVA. Both extruders had a nozzle diameter of 0.4 mm. The vocal tract halves were oriented with the cutting plane, i.e. the midsagittal plane on the build plate. The build plate was heated to 60° C with a heating bed for better adhesion. The print time was about 20 h per half. The material consumption was about 50 g PLA and 10 g PVA per half, i.e. the mass of a complete vocal tract model was about 100 g. After printing all objects and dissolving their support structures, the two halves of each vocal tract model were carefully sanded at the side that adhered to the build plate and glued together with cyanoacrylate adhesive ("superglue").
Due to the PLA material used for 3D printing, the walls of the vocal tract models were essentially hard compared to the soft walls of a human vocal tract. For the sake of reproducibility, we made no attempt here to create more realistic soft walls, because suitable methods to achieve this for detailed vocal tract geometries have not been explored yet. However, future studies could readily use the models in the dataset to create and examine soft-walled models.

Measurement of the volume velocity transfer functions.
For each of the 44 physical vocal tract models, the volume velocity transfer function (VVTF) was measured. The VVTF ω H( ) is often used to characterize vocal tract acoustics 12,44,45 and usually defined as the complex ratio of the volume velocity ω U ( ) 2 through the lips to the volume velocity ω U ( ) 1 through the glottis, i.e., here, the transfer functions were determined for the case of an infinite glottal impedance, i.e., a closed glottal end of the tubes. The determination of the VVTF based on Eq. (1) is technically very challenging 46 , because it would require a broadband volume velocity source ω U ( ) 1 at the glottis and a broadband volume or particle velocity sensor at the mouth.
A simpler yet precise approach to determine ω H( ) was presented by Fleischer et al. 47 , which was also adopted in the present study. Fleischers' method does not require a volume velocity source or sensor, but can determine ω H( ) solely from two sound pressure measurements ω P( ) 1 and ω P ( ) 2 at the glottis and the lips, respectively, as described below. This method is based on the principle of reciprocity and theoretically well-founded 47 .
The experimental setup for the measurements is shown in Fig. 2a. The vocal tract model was placed at a fixed distance of about 30 cm in front of a loudspeaker. A 1/4-inch measurement microphone (MK301E capsule with MV301 preamplifier by Microtech Gefell) was inserted into the hole at the glottal end of the model so that its membrane was flush with the glottal plane. A measurement consisted of two steps. In the first step, the loudspeaker emitted a broadband excitation signal (sine sweep) into the open mouth of the model while the sweep response ω P( ) 1 was measured with the glottis microphone. In the second step, the mouth of the model was tightly closed with a plug made of modeling clay (about 1 cm thick) and another 1/4-inch measurement microphone (G.R.A.S. 46BL) was centrally positioned about 2 mm in front of the closed mouth. This microphone recorded the response ω P ( ) 2 for the same excitation signal as in step 1. The VVTF was finally calculated as ω ). Both microphones were connected to an audio interface (Terratec Aureon XFire 8.0 HD), which in turn was connected to a laptop computer (MSI GT72-2QE) with the operating system Windows 8.1, 64 Bit.
The measurements were performed with the open-source software MeasureTransferFunction 48 , which implements the method by Farina 49 . The excitation signal used in this software was a logarithmic sine sweep with a power band from 100 Hz to 10,000 Hz (fade-in and fade-out from 50-100 Hz and 10,000-11,000 Hz, respectively) and a duration of 10.4 s. The source signal amplitude was set to 0.5, i.e. to 50% of the value range. The output level and input level of the audio interface were set to 100% and 50%, respectively. The audio signals were sampled with 96,000 Hz and quantized with 24 bit. A major benefit of using logarithmic sweeps to characterize acoustic systems is that the linear impulse response can be separated from signal components generated by harmonic distortions 49 . Accordingly, the linear response was manually extracted in all recorded signals before further processing. The different sensitivities of the microphones used at the glottis (3.2 mV/Pa) and the mouth (18 mV/Pa) were compensated by adding 15 dB to the calculated VVTF.
Due to small variations of the latency of the audio system, there was usually a small time lag τ between the sweep responses, from which ω P( ) 1 and ω P ( ) 2 were calculated. According to the time-shift property of the Fourier Transform, the shift of a time signal by τ causes its spectrum to be changed by the factor ωτ e j , where = − j 1 .
www.nature.com/scientificdata www.nature.com/scientificdata/ This means that a phase response ϕ ω ω = H ( ) arg ( ) is the sum of the "true" phase response and a linear function ωτ, where the slope τ may vary across models. Therefore, to explore the phase responses of the models, it might be more convenient to do it on the basis of the group delay ϕ ω ω −d d ( )/ , where the linear function translates into a constant offset.
All measurements were performed in the large climate-controlled anechoic chamber at the TU Dresden at a temperature of 22 °C, an atmospheric pressure of 1007 hPa, and an air humidity of 46%. The anechoic chamber is a free-field room (all six sides covered with 1 m absorbing foam spikes) with a free volume of 1000 m 3 and a degree of sound absorption of at least 99% for frequencies between 60 Hz and 16 kHz. Before the measurements, the vocal tract models were tightly wrapped in multiple layers of sound-absorbing fabric. This minimized the external excitation of the (plastic) vocal tract walls by the source signal during the measurement of P 1 . Wall vibrations due to the external excitation would otherwise interfere with the sound field in the models and create spectral artifacts. The two sweep responses P 1 and P 2 are contained in the files XX-sweep-primary.wav and XX-sweep-reference.wav in the dataset. The transfer functions ω H( ) are given in the files XX-vvtf-measured.txt.
calculation of the volume velocity transfer functions. For comparison with the measurements of the physical models, the VVTFs were also determined numerically using the finite element method (FEM). The calculation was similar to that described by Fleischer et al. 47 on the basis of the freely available software FEniCS 50 (http://fenicsproject.org). To create the FE models, the inner surface meshes of the vocal tract (XX-inner-surface. stl) were first "closed" at the glottal end and the mouth, as in the files XX-inner-surface-closed.stl. These closed-surface meshes were then converted into volume meshes (XX-fem.msh) for the FE simulations with the free software Gmsh 51 (http://gmsh.info/). In the volume meshes, the regions of the glottis, the mouth opening, and the vocal tract walls were manually marked to define the boundary conditions for the acoustic simulation. The FE models were discretized with linear shape functions and had a number of degrees of freedom between 99,688 (model s1-22-ehe-schwa) and 147,806 (model s2-08-guete-tense-y). Furthermore, the maximum mean element size was 2.99 mm (model s1-22-ehe-schwa). For a maximum analysis frequency of 10,000 Hz and a sound velocity of 345 m/s at 22 °C, there were on average 11 elements/wavelength.
The acoustic simulation was based on the numerical analysis of the Helmholtz equation where P is the complex-valued scalar acoustic pressure, → x is the position in  3 , ω is the angular frequency, κ ω = c / is the wave number, and c = 345 m/s is the speed of sound at 22 °C. The application of a frequency-independent particle velocity V 0 at the glottis leads to the boundary condition www.nature.com/scientificdata www.nature.com/scientificdata/ was implemented. The radiation impedance was assumed to be where π = r A /(2 ) lips and A lips represents the lip opening area 52,53 . Based on the computed pressure P lips at the central point of the lip opening, the default value V 0 , and the geometrical measures A lips and A glottis , the transfer function tract models in aeroacoustic terms, the setup in Fig. 2b was used to create different levels of stationary airflow through the models. For each level, we recorded the volume velocity U sub , the subglottal pressure P sub , and the turbulence sound P rad radiated from the mouths of the models. The airflow was generated by a fan (type U71HL-024KM-43 by Micronel) and led into an air tank, which was connected to a "lung" via a 200 cm long rubber tube with an inner diameter of 19 mm. The air tank and the lung were boxes with inner volumes of 30 × 30 × 50 cm 3 and 23 ×23 × 23 cm 3 , respectively. Both boxes were lined with sound absorbing foam (NOMA ACOUSTIC 25 mm by NMC) and meant to attenuate the noise from the fan. A short horn connected to a straight tube (18 mm inner diameter) was used to represent the bronchia and the trachea and led the airflow from the lung to the glottal end of the vocal tract models. The dimensions of the horn and the tube were chosen to approximate the cross-sectional area function of the human subglottal system 54 . Both the horn and the tube were 3D-printed with the material PLA and with a wall thickness of 3 mm (100% infill ratio). The upper 3 cm of the tracheal tube (corresponding to the conus elasticus) tapered from 18 mm diameter to 10 mm diameter to match the diameter of the glottal hole of the attached vocal tract model. The 3D-printable volume models for these parts are contained in the files trachea.stl and bronchial_horn.stl in the dataset. A data acquisition device (DT9837C by Data Translation) connected to a laptop computer (MSI GT72-2QE running MS Windows 8.1) was used to simultaneously measure • the radiated sound pressure P rad using a measurement microphone (1/2 inch capsule MK 250 with preamplifier MV 210 by Microtech Gefell GmbH) positioned 30 cm in front and 30 cm sideways from and directed towards the mouth of the vocal tract model (to prevent the airstream from directly hitting the microphone membrane), • the subglottal pressure P sub using a pressure measuring device (DMU4 by Kalinsky Sensor Elektronik, Erfurt, Germany) attached to a pressure tap 12 cm below the glottis, • and the volume velocity U sub at the entrance of the lung using a flowmeter (type AWM720-P1 by Honeywell).
All three signals were digitized with a sampling rate of 48,000 Hz and quantized with 24 bits. A custom-made software was used to record and display the signals, and to control the fan power. The fan power could only be adjusted in (small) steps. For a more precise adjustment of the subglottal pressure and the flow, we used a servo valve attached to the air tank. A single-board computer (type Raspberry Pi 3 Model B+) with a custom-made Python script was used to translate the high-level commands of the software on the laptop computer into electrical control voltages for the fan and the valve. The air tank with the fan and the valve were located in a separate soundproofed chamber to prevent their noise from disturbing the measurements.
For a consistent aeroacoustic characterization of the vocal tract models, we decided to apply the same six levels of fluid power (which is the product of the subglottal pressure and the volume velocity) to each model, namely 500 mW, 1000 mW, 1500 mW, 2000 mW, 2500 mW, and 3000 mW. Using fixed power levels instead of fixed levels of subglottal pressure or flow allowed to cope with the wide range of flow resistances across the models. According to the analysis by Stevens 45 , a fluid power level of 500 mW is roughly typical for "normal" speech production, while about 3000 mW is the maximum that humans can achieve.
For each power level and model, the three signals described above (radiated sound, flow, subglottal pressure) were captured for a duration of 10 s. The audio files of the radiated sounds are included in the dataset as the files XX-noise-500mW.wav, …, XX-noise-3000mW.wav. The samples in these files are floating point values proportional to the sound pressure measured at the microphone, where the value 1.0 corresponds to a sound pressure of 12.62 Pa. For each of these audio files, the power spectral density (PSD) has been estimated using Welch's method as implemented in the function pwelch() in the Signal Processing Toolbox of Matlab R2017b. We used overlapping Hamming windows of 1024 samples (which was also the FFT length) and a window overlap of 512 samples so that the spectral resolution was 46.9 Hz. The resulting PSDs (with the unit Pa 2 /Hz) for the six power levels were summarized in the files XX-noise-psd.txt. Finally, the average volume velocity, the average subglottal pressure, and the overall sound pressure level (SPL) of the radiated sound for each power level were tabulated in the files XX-noise-metadata.txt. The SPLs were calculated from the audio signal x(k) as Synthesis of speech sounds with the 3D-printed models. For each 3D-printed vocal tract model, the corresponding speech sound was (physically) synthesized. The tense and the lax vowels were synthesized with the setup in Fig. 2b, but with a vibrating reed source inserted between the upper end of the trachea and the glottal hole of the vocal tract models as in Birkholz et al. 31 . The vibrating reed source was developed by Arai 33 and is an improved version of the design published previously 32 . The subglottal pressure was individually adjusted for each vowel as roughly the midpoint between the onset and the offset pressures of the source in combination with the respective supraglottal model. The generated sound was captured with a measurement microphone (1/2 inch capsule MK 250 with preamplifier MV 210 by Microtech Gefell GmbH) positioned 30 cm in front and 30 cm sideways from and directed towards the mouth of the vocal tract model. The sound generated by each model was recorded for 10 s. The recordings were symmetrically cropped around their center to a length of 1000 ms for the tense vowels and to a length of 200 ms for the lax vowels. They were then peak-normalized and windowed with a Tukey (tapered cosine) window for a fade-in and fade-out of 20 ms each. Finally, the stimuli were padded with 200 ms of silence at the beginning and end. The voiceless fricatives were synthesized with the setup in Fig. 2b with a constant subglottal pressure of 800 Pa. Each of these sounds was also recorded for 10 s, cropped to 1000 ms around the center of the recording, and otherwise processed like the vowel recordings. The resulting audio signals are contained in the dataset as the files XX-model-sound.wav.

Data Records
The dataset is available at the platform figshare (https://doi.org/10.6084/m9.figshare.c.4869732) 55 and mirrored at www.vocaltractlab.de/dvtd.zip. The included Matlab and Python scripts are additionally available on GitHub (https://github.com/TUD-STKS/DVTD). The repository has the directory structure shown in Fig. 3, with one directory for the data of subject 1, one for the data of subject 2, and one for miscellaneous. The miscellaneous directory contains the following files: display_data.m, display_data.py are two functionally identical Matlab and Python scripts to display the measured data. bronchial_horn.stl, trachea.stl are 3D-printable models of the subglottal structure that was used for the turbulence noise measurements.  www.nature.com/scientificdata www.nature.com/scientificdata/ combined_plots.pdf contains plots of the VVTFs and the power spectral densities for all models. definitions.py, fem_run.py, model_names.py, modules.py, to_h5_model.py are Python scripts to run the FE simulations (see section "Code Usage").
The directories for each subject contain two STL files with volume models of the mandible and maxilla, and one sub-directory for each speech sound. Each sub-directory contains the following files (where XX is the speech sound label according to Table 1: XX-mri/ZZ.ima are the (raw) MRI files of the vocal tract. XX-inner-surface.stl represents the inner surface geometry of the vocal tract model in terms of a triangle mesh.
The plane of the glottis and the lips are open. The unit of the coordinate values is mm (as for all other STL files). XX-inner-surface-closed.stl is similar to XX-inner-surface.stl, but closed at the glottal end and at the lips. It was the basis for the creation of the finite-element model. XX-fem.msh contains the 3D finite-element mesh including the definitions of nodes, elements, domain entities and flags for boundary conditions. It is the native file format of the software Gmsh (http://gmsh.info/) to store the numerical models. This file type is supported by a lot of free partial-differential-equation solvers. XX-printable-model.stl is the complete 3D-printable solid volume model of the vocal tract. XX-printable-left-half.stl is the left half of the printable volume model. XX-printable-right-half.stl is the right half of the printable volume model. XX-vvtf-measured.txt contains the (linear) magnitude and phase (in rad) samples of the measured VVTF (based on the 3D-printed models) from 100 Hz to 10,000 Hz as tabular data. The frequency resolution is 0.961304 Hz. Below 100 Hz and above 10,000 Hz the magnitude and phase samples are set to 1 and 0, respectively. XX-sweep-reference.wav, XX-sweep-primary.wav contain the reference and primary sweep responses from which the VVTF was obtained. These files can be used with the software MeasureTransferFunction 48 to retrace the determination of the VVTF. XX-vvtf-calculated.txt contains the (linear) magnitude and phase (in rad) samples of the calculated VVTF (based on the FE models) from 0 Hz to 10,000 Hz as tabular data. The frequency resolution is 0.961304 Hz (same as for the measured VVTF). XX-noise-500mW.wav, ..., XX-noise-3000mW.wav Each of these audio files contains a 10 s recording of the noise generated by a stationary flow through the glottis of the model for six values of the fluid power between 500 mW and 3000 mW. The sound was measured 30 cm in front and 30 cm sideways of the mouth. The sampling rate is 48,000 Hz. The samples are encoded as floating point numbers between −1 and +1 and proportional to the sound pressure. The value +1 corresponds to a sound pressure of 12.62 Pa (=116 dB). XX-noise-psd.txt contains the power spectral densities of the noise measured for the six different fluid power levels as tabular data. The frequency resolution is 46.9 Hz. XX-noise-metadata.txt contains for each of the six fluid power levels the average volume velocity, the average subglottal pressure, and the sound pressure level of the radiated sound (recorded 30 cm in front and 30 cm sideways from the mouth opening). XX-model-sound.wav is an audio file that was generated by exciting the 3D-printed vocal tract model with either a vibrating reed source (for the vowels and the lateral) or a constant airflow through the glottis at a subglottal pressure of 800 Pa (for the fricatives). The file was used in the perception experiment for the evaluation. The sampling rate is 48,000 Hz. XX-reference-sound.wav is an audio recording of the corresponding speech sound by the (real) subject s1 or s2. Due to the high noise level in the MRI scanners, the recordings were performed separately from the MRI scan sessions in a sound-proofed room. The subjects were recorded in a sitting position and were instructed to sustain the speech sounds as similar as possible to the sounds in the scanner. The sampling rate is 44,100 Hz.
The file types used to store the data can be opened with a range of programs on the different operating systems. The following is a list of free and platform-independent software tools (running on Windows, Linux, MacOS) to open the file types: r 8 5 mm to roughly represent the vowel /ə/. The tube was both 3D-printed and converted into an FEM mesh. Like for all models in the dataset, the VVTF of the printed model was measured and the VVTF of the FE model was calculated. From both transfer functions, the first three resonance frequencies were determined by peak picking. The analytical solution for the resonance frequencies is given by (2020) 7:255 | https://doi.org/10.1038/s41597-020-00597-w www.nature.com/scientificdata www.nature.com/scientificdata/ where n = 1, 2, 3, ..., c = 345 m/s is the sound velocity at 22° C, = . L 165 4 mm is the tube length, and ∆ = . = .
L r 0 6 5 1 mm is the open-end correction of the tube length. The first three resonance frequencies obtained from the analytical calculations, the measurements, and the FEM simulations are summarized in Table 2 and agree very well with a maximal difference of 1.4%. Figure 4a shows the measured and calculated VVTFs. They confirm the high agreement of the resonance frequencies, and furthermore demonstrate the low noise of the measurement (black curve). However, towards high frequencies, differences in magnitude of a few dB occur between the curves. Similar differences exist also for the models of the dataset, and are presumably caused by small differences in the losses or boundary conditions. Validation of noise measurements. In contrast to the transfer functions, the noise measurements obtained during the aeroacoustic characterization of the vocal tract models (Fig. 2b) could not be compared to corresponding simulation results (such simulations are subject of ongoing research). Instead, we qualified the signal-to-noise ratio (SNR) of the measurements by measuring the noise floor, and validated the absolute scale of the measured audio data to ensure that our custom-made software did not introduce any scaling errors. Two examples of obtained power spectral densities for a vowel and a fricative (both for the lowest fluid power level of 500 mW) are shown in Fig. 4b. Here it can be seen that for frequencies above 200 Hz, the signals are at least 10 dB above the noise floor (dashed line). Furthermore, for the vowel /eː/ of speaker s1 (gray curve) the PSD clearly shows the resonances of the vowel (gray arrows). These peaks correspond well to the resonance frequencies obtained by the VVTF measurements (Table 3).
To validate the absolute scale of the measured audio data, the SPLs calculated with Eq. (3) were compared with the readings of a hand-held sound level meter at the position of the measurement microphone. The SPL values agreed within ±2 dB. Accordingly, the audio samples in the noise audio files XX-noise-500 mW.wav, …, XX-noise-3000 mW.wav closely approximate absolute sound pressure values (the audio sample value 1.0 corresponds to 12.62 Pa).  Table 2. The first three resonance frequencies of the cylindrical tube obtained by the analytical calculation, the FEM simulation, and the measurement using a 3D-printed physical model. www.nature.com/scientificdata www.nature.com/scientificdata/ Analysis of the resonance frequencies of the vocal tract models. To check the similarity of the calculated and the measured transfer functions for the resonators in the dataset, the first three resonance frequencies f R1 , f R2 , and f R3 have been extracted from the magnitude spectra. The obtained resonances are given in Tables 3 and 4 for the male and the female speaker, respectively. It can be seen that the frequency deviations are mostly below 1%, with the highest deviation being 6.0% for f R1 of the /∫/ of speaker s1. The average absolute deviations across all three resonances and all vowels are 0.9% for s1 and 0.8% for s2. The overall similarity of the simulated and measured transfer functions (also between the resonance frequencies) is also high, as can be visually checked with the provided scripts. As expected, the deviations get stronger with increasing frequency, because local details of the geometry of the vocal tract become more important.
The plots in Fig. 5 show the first two measured resonance frequencies of the vowels. They show that the vowels are arranged in the typical way known from formant plots 45 , especially along the periphery of the vowel space (although resonances and formants are not identical concepts and their frequencies may differ 56 ). The relative order of the resonance frequencies of the lax vowels partly differs between the two speakers, for example, f R2 of /ə/ is comparatively high for s1, and more in the center of the vowel space for s2. However, the plots confirm that all vowels substantially differ from each other, i.e, no two vowels were produced in exactly the same way. Hence, the recorded samples provide a good coverage of the vowel space.  Table 3.
) and calculated ( f i R ,c ) resonance frequencies of the vocal tract models of the male speaker (s1) and their differences in percent. ) and calculated ( f i R ,c ) resonance frequencies of the vocal tract models of the female speaker (s2) and their differences in percent. www.nature.com/scientificdata www.nature.com/scientificdata/

Results
The results of the three forced-choice tasks are presented in terms of confusion matrices in Fig. 6. As each stimulus was rated five times by 20 listeners, the numbers in the matrices are both absolute values and percentage values. For subject 1, the recognition rate was 68.3% (78.6%) for the model (reference) stimuli of the tense vowels, 53.0% (60.7%) for the model (reference) stimuli of the lax vowels, and 64.4% (82.6%) for the model (reference) stimuli of the fricatives. For subject 2, the recognition rate was 57.0% (77.4%) for the model (reference) stimuli of the tense vowels, 47.3% (71.7%) for the model (reference) stimuli of the lax vowels, and 35.6% (65.8%) for the model (reference) stimuli of the fricatives. Hence, for both subjects and all three groups of stimuli, the synthesized stimuli were recognized worse than the natural stimuli. However, Fig. 6 shows that the patterns of confusion of the speech sounds were quite similar between the natural and synthetic stimuli.
The lower recognition rate of the synthesized phonemes may have multiple reasons: • The vocal tract models were 3D-printed with hard plastic walls. In reality, the vocal tract walls are softer and lead to a stronger damping of the resonances. • The nasal port was closed in all vocal tract models. In reality, vowels are often produced with a slightly open nasal port (especially the lower vowels), which affects the overall spectral shape and hence the perception. • All synthetic vowels were excited with the same vibrating reed source, which was tuned to a typically male fundamental frequency. The low-frequency excitation of the (smaller) female vocal tract models (speaker 2) may have led to perceptual confusions, as this combination occurs rarely in natural speech. • For the fricatives, the cross-sectional area of the critical constriction is very important for a realistic noise spectrum. The low recognition rates of some model sounds (especially /f/) may be due to overly smoothed vocal tract geometries around the constrictions which distorted the noise spectra. Recognition of reference and model speech sounds of subject 2 Fig. 6 Confusion matrices obtained from the listening experiment with the naturally-produced stimuli (reference) and the artificially generated stimuli (model) using the 3D-printed vocal tract models. The left two columns of subpanels show the confusion matrices for subject 1, and the right two columns for subject 2. The top, middle, and bottom rows of subpanels show the results for the tense vowels, the lax vowels, and the fricatives, respectively. As each stimulus was rated five times by 20 listeners, the numbers in the matrices are both absolute values and percentage values.
www.nature.com/scientificdata www.nature.com/scientificdata/ Despite these limitations, many vowels and fricatives were successfully synthesized, indicating valid and representative vocal tract shapes of German speech sounds in the dataset.

Usage Notes
Running the Finite-element code. This section describes how to run the FE simulation on the basis of the files XX-fem.msh to generate the VVTF files XX-vvtf-calculated.txt. Possible reasons to re-run the simulations may be the need to evaluate the sound pressures or volume velocities at different positions in the FEM mesh, to generate the spectra with a different frequency resolution, or to extend the equations of the simulation. Running the simulations requires the software FEniCS (https://fenicsproject.org/), which you must install prior to running the scripts using the guide appropriate for your OS from the website. The subsequent steps were tested under Ubuntu versions 16.04 and 18.04. The first step is the conversion of the MSH file of a model into an XML file by typing dolfin-convert XX-fem.msh XX.xml. on the command line (where XX is the sound label, as before). Make sure to omit "-fem" from the XML file name. The XML must otherwise have the identical name. Using the provided python script to_h5_model.py (in the misc subdirectory), this XML file must be converted into a H5 file, which allows the parallel execution of the simulation: python3 to_h5_model.py XX (filename without extension). Both, the XML and the H5 version of the model contain all necessary information about the surface-IDs and the volume-IDs, which are needed for the FE simulation. To calculate the VVTF for a model, the provided Python script fem_run.py must be called: python3 fem_run.py. The specific subject and model to simulate must be specified in the file definitions.py. The simulated VVTF is written to the file XX-vvtf-calculated.txt in the respective subfolder.

code availability
All the software used in this study are open source (see the previous sections for references or URLs). In addition, the dataset contains several custom-made Matlab and Python scripts (see section "Data Records").