Automatic vocal tract landmark localization from midsagittal MRI data

The various speech sounds of a language are obtained by varying the shape and position of the articulators surrounding the vocal tract. Analyzing their variations is crucial for understanding speech production, diagnosing speech disorders and planning therapy. Identifying key anatomical landmarks of these structures on medical images is a pre-requisite for any quantitative analysis and the rising amount of data generated in the field calls for an automatic solution. The challenge lies in the high inter- and intra-speaker variability, the mutual interaction between the articulators and the moderate quality of the images. This study addresses this issue for the first time and tackles it by means of Deep Learning. It proposes a dedicated network architecture named Flat-net and its performance are evaluated and compared with eleven state-of-the-art methods from the literature. The dataset contains midsagittal anatomical Magnetic Resonance Images for 9 speakers sustaining 62 articulations with 21 annotated anatomical landmarks per image. Results show that the Flat-net approach outperforms the former methods, leading to an overall Root Mean Square Error of 3.6 pixels/0.36 cm obtained in a leave-one-out procedure over the speakers. The implementation codes are also shared publicly on GitHub.


Methods
Data. The study considers static midsagittal MRI recorded between 2002 and 2011 from 9 French speakers (5 males, 4 females), referred to as subjects in this study, sustaining 62 different articulatory positions, also referred to as classes in the context of machine learning, designed to be representative of the French phonemic repertoire 3,36 . This study does not include any human experiments more than the use of non-invasive MRI data collection mentioned above. All the methods and data acquisition were carried out in accordance with the guidelines and regulations of the local ethic committee called CPP, 'Comité de Protection des Personnes' 37 (English translation: Committee for the Protection of People) and the recording protocols were approved by this ethic committee. All subjects were older than 18 and an informed consent was obtained from all of them. The images have been recorded either on a 1.5 or on a 3 Tesla MRI scanner and have a field of view of 256 × 256 mm 2 and a resolution of 1 mm per pixel. Note that two speakers have been discarded in comparison to Serrurier et al. 3 and Valdés Vargas 36 due to the significantly lower quality of the images, the different fields of view and the different sizes of the images. 21 anatomical landmarks relevant to the study of the speech articulations have been identified. They represent either characteristic landmarks of the speech articulators, such as the tip of the tongue, or the junction between two articulators. They are listed in Table 1 and illustrated for one articulation of one subject in Fig. 1. They have been manually identified on all images of the dataset by an expert. Note that the upper and lower teeth landmarks (UT and LT) denote dental structures and, as such, are not distinguishable from the air on MRI data. They have been determined for each subject by contrast with soft tissues on an articulation acquired on purpose and reported on the other images using their relative position with the hard palate. Please refer to Serrurier et al. 3 for further information regarding this procedure as well as the data collection and processing.
challenges. Localizing landmarks of the vocal tract area on midsagittal MRI images presents particular characteristics. Figure 2 shows a few articulations from different subjects illustrating the diversity of the dataset. The main challenges are summarized in the following list.
• The shapes and positions of the vocal tract articulators are characterized by a high variability, due to the variety of the speech task and to the different morphologies and articulatory strategies of the speakers to perform a same task.
• Some articulators such as the tongue, velum, lips and epiglottis present a high variability in the vocal tract area, leading to very different locations in the dataset associated to the landmarks VT, TT, TS, LLSV, ULPV, ET and TE. • Some tissues may touch each others for certain articulations, leading to hardly distinguishable landmarks at these locations, such as TE, TS and ET. • The larynx area appears very difficult to capture on midsagittal MRI data, leading to a confusing area on the images and hardly identifiable landmarks, such as PL and EG. • Some articulators may occasionally show very different shapes than in the large majority of the articulations, such as the velum rolled up against the tongue for a few articulations 36 , leading to unusual location of the landmark VT. • Two important landmarks for speech production analyses are the teeth landmarks UT and LT, which are not directly visible on the images, as mentioned earlier.
• The images are recorded at different times with different scanners, leading to variable quality and noise levels.
Similarly, the quality and noise level may not be homogeneous within a single image. • Despite the high variability, the size of the available dataset, i.e. 9 subjects with 62 articulations, hence 558 articulations, is rather limited in comparison to those usually used for landmark localization in the literature, e.g. over 10 K images for mentioned refs. 23,34,38 . Methods. In order to compare the method proposed in this paper with the current state-of-the-art in the field, eleven methods taken from the literature have been adapted to our problem and implemented: (1) five methods from the literature related to facial landmark detection and body pose estimation and (2) six general methods for landmark localization and based on the concept of heat-maps. Our method, referred to as Flat-net, is inspired by these methods and presented at the end.
Methods from facial landmark detection and body pose estimation. The five methods considered in this section are the dlib, HyperFace, Deep Alignment Network, Shape Fitting by Deep-Regression and Multi-Context Attention Model methods and are presented as follows.
dlib. The algorithm available as part of the dlib library is an implementation of the ensemble of regression trees presented in 2014 by Kazemi and Sullivan 39 . This technique takes advantage of simple features with fast computing capacities, e.g. the pixels' intensity differences, to directly estimate the landmark locations. These locations are subsequently refined with an iterative process made of a cascade of regressors and using gradient boosting. Note that the dlib method is the only method considered in this study not based on DL.
HyperFace. The HyperFace method makes use of an end-to-end DL network for simultaneous face detection, landmark localization, pose estimation and gender recognition 23  In their approach, a heat-map is defined as an image with highest intensity values at the exact locations of all the considered landmarks and decreasing intensities around as a function of the distance to the nearest landmark. The last two layers of each stage are fully connected layers. Shape fitting by deep-regression. The Shape Fitting Deep-Regression (SFD) method has been proposed recently. It combines CNNs with model-based fitting algorithms of the landmark positions such as obtained by Principal Component Analysis (PCA) 41 . The PCA is included in the network using a dedicated layer type. This network is characterized by a fast computing performance and can process until several hundreds of frames per second.
Multi-context attention model. The Multi-Context Attention Model (MCAM) method is an extended version of the DL stacked hourglass networks 42 designed for human pose estimation and body joint localization 43 . It generates heat-maps describing the body joint locations by using multiple resolutions, conditional random fields and an original layer type combining various convolutional layers together. For training, the ground truth heat-maps are generated by 2-D Gaussians centered on the joint locations. The generated heat-maps contain all joint locations together and are further split into partial heat-maps for each body joint by means of an extra spatial classifier. Since the network is designed to localize 16 body joints, two of these networks are necessary in practice in the current study to localize the 21 landmarks.
Heat-map-based methods. Previous methods aim at detecting landmark coordinates from images. One limitation of these approaches come from the different nature of the input and output data. An alternative to overcome this is to consider the landmark coordinates as images via heat-maps in output. One of the pioneering attempt to implement such approach can be attributed to Pfister et al. 32 for landmark coordinate detection for body pose estimation.
In this approach, a single landmark is described as a full image, the heat-map, with a maximal intensity on the landmark location. Several landmarks can be considered on the same heat-map or can be described as several channels of an image (or tensor), leading to the heat-maps in channels (one heat-map per landmark).
Localizing the landmarks on an image consists therefore in generating the associated heat-maps, instead of the vectors of the landmarks' coordinates as for the previous studies. The landmark locations are then straightforwardly derived from the heat-maps as the points with maximal intensity in each channel. This kind of approach  www.nature.com/scientificreports www.nature.com/scientificreports/ appears particularly appropriate for medical image processing 16,17 . Note that this concept of heat-map is related to the concept of heat-maps mentioned in the DAN and MCAM methods. However, in these methods, the heat-maps contain all landmark/joint locations together and are not directly provided as output of the networks.
Practically in using heat-maps in channels, for an input image of size N × N, the network generates L output heat-map images (i.e. tensor H ∈ R N×N×L ) where the l th heat-map (H l ∈ R N×N×1 ) has the maximum intensity value at the location of the estimated l th landmark. The normalized target heat-maps are obtained via a Gaussian hat with a σ (=10 in our experiment) pixels and a maximum of 1 around each landmark location. Figure 3 shows an example of the combination of 3 heat-maps created for 3 landmarks and displayed as the 3 channels of a single RGB image. The predicted location of the l th landmark corresponds to the location of the maximum value in the l th predicted heat-map: Six methods based on the generation of heat-maps are considered in this section. The first five methods make use of the implementation proposed by Payer et al. 16 and for which the code is publicly available. Payer et al. investigate several DL approaches for landmark localization on medical images based on the generation of heat-maps. They implement existing networks such as Downsampling-net, ConvOnly-net, and U-net. They also propose a new architecture called SpatialConfiguration-net and its extension with embedded U-net 16,17 . These five methods show promising results for localizing diverse landmarks on various biomedical image modalities as mentioned in the introduction 16,17 (finger landmarks on X-Ray and MRI, cranium landmarks on lateral cephalogram X-Ray and spine landmarks on Computer Tomography scans). These methods are briefly described below, please refer to the dedicated references 16,17 for further details. The last method presented in this section, the pix2pix's generator network covers our own implementation of the concept of heat-maps into the existing pix2pix network.
Downsampling-net. The Downsampling-net (DS-net) methods alternates convolution and pooling layers in the network. The downsampling approach ensures covering large image areas with small kernel sizes but poor accuracy in localization has to be expected due to the associated low resolution of the resulting heat-maps 16 . This network with its default hyper-parameters did not converge in our experiment.
ConvOnly-net. The ConvOnly-net (conv-net) method aims at overcoming the low resolution of the generated heat-maps mentioned above by discarding all pooling and strided convolution layers 16 . This network results in six convolution layers.
U-net. The U-net 44 is a well-known architecture for medical image processing, more specifically for segmentation. The U-net includes convolutional layers with skip connections and pooling layers in an auto-encoder architecture, making efficient the processing of large images even with small kernel size. The implementation proposed by Payer et al. 16 and used in this study is a slightly modified version, where for instance the maximum pooling has been replaced by an average pooling.
SpatialConfiguration-net. The SpatialConfiguration-net (SCN) method combines interestingly the local appearance of the landmarks with their spatial locations in reference to all other landmarks. It consists of three blocks. The first block is made of three convolutional layers with small kernel sizes, resulting in local appearance heat-maps. The second and third blocks refine the outputs of the first block by considering the spatial configurations and by combining them with the initial outputs to discard the unrealistic results 16 .
SpatialConfiguration-net with embedded U-net. The SpatialConfiguration-net with embedded U-net (SCN(U-net)) method is an extension of the SCN method which embeds a U-net network into the local appearance extraction block 16,17 .
pix2pix's generator network. This architecture exploits the generator component of the pix2pix network 45 , referred to in the following as p2p-GN, standing for pix2pix's generator network. It is based on a hourglass-shaped CNN with skip connections. This network is specifically designed to analyze and generate images, hence particularly adapted in our case for the generation of heat-maps from MRI. It has already proved to be very efficient for implementation and evaluation. The 9 × 62 input grayscale images are converted into grayscale RGB by simple channel repetition to comply with the input format of the networks. Considering the relatively limited number of data, the dataset is augmented 47 via 10 different arbitrary methods as follows: (1) Addition of a Gaussian intensity noise of mean equal to 0 and variance to 12.75, (2) Blurring with a Gaussian of variance σ = 5 pixels, (3-4) Rotation of +10 and −5 degrees, (5-6) Translation of (+30, +10) pixels and (+40, −10) pixels, (7) Rotation of −5 degrees followed translation of (+30, +10) pixels, (8) Zooming out of scale 0.8, (9) Translation of (+30, +10) pixels followed by zooming in of scale 1.2, and (10) Translation of (+40, +20) pixels followed by zooming out of scale 0.9 plus blurring with a Gaussian of variance σ = 3 pixels. By this method, the dataset is artificially augmented from 9 × 62 = 558 to 11 × 9 × 62 = 6138 images.
The errors of prediction of the landmark coordinates are evaluated by means of Euclidean distance and Root Mean Squared Error (RMSE). The Euclidean distances are expressed in pixels to comply with the existing results in the literature of the domain while the RMSE is expressed in centimeters to provide comprehensive and interpretable results for speech analyses purposes. In addition, the percentage of samples (i.e. test images) with landmarks presenting distance errors higher than 5 pixels (i.e. outliers), are reported. Moreover, the statistical significance of the difference between the means of the distance errors obtained for our methods on the one hand and all the other methods on the other hand is evaluated by means of paired t-test.
For the l th landmark, if x y ( , ) g g l and x y ( , ) p p l denote respectively the ground-truth and the predicted coordinates, the Euclidean distance d l is calculated as follows: Using similar notations, the RMSE is calculated as follows: where Q is the number of considered elements, e.g. all the landmarks of all images, and q is the corresponding index.
Most of the landmark localization methods are evaluated in the literature by a hold-out scheme, i.e. by splitting the data into train/test sets, for instance 21,997/1,000 for the HyperFace method 23 , 2,000/330 for the dlib method 39 or 3148/600 for the DAN method 24 . In the current application, the ultimate objective is however to localize landmarks on new speakers, i.e. on speakers where no data were available before. To comply both with the literature benchmarks and the specificity of the study, the performances are evaluated via two schemes: (1) the randomized 10-fold cross-validation (CV) and (2) the leave-one-subject-out cross-validation (LoSo). Note in addition that 5% of the training data in each training session are randomly set aside in advance for validation purposes and tracking the learning curves. The learning would stop when the loss curve of the validation set reaches plateau.
In the CV scheme, the augmented dataset is randomly split into 10 groups, 9 being used for training purposes, i.e. 5,525 samples (after rounding), and 1 for test purposes, i.e. 613 samples (after rounding). The training and evaluation is then repeated 10 times until each single group has been used as the test set. Each sample is therefore being used as a test sample at some point during the process. Note that in this evaluation scheme, the train and test sets may share data of same subjects and/or of same articulations, making the two sets not completely independent. However, in accordance with the literature in the domain, a very large dataset with many more subjects and articulations could be perfectly evaluated through this scheme and is therefore considered in this study. Above all, this scheme is considered in our evaluation to assess the validity of the hyper-parameters reported by the methods for our environment.
In the LoSo scheme, the 62 images of one arbitrary subject are set aside to serve as the test set. The remaining images are then augmented, leading to 8 × 62 × 11 = 5, 456 images, and used for the training. In other words, the network is not trained with data from the test subject. The training and evaluation is then repeated 9 times until each subject has been used as the test set. Each sample is therefore being used as a test sample at some point during the process on a model trained on the other subjects. This scheme is much stricter and challenging that the CV scheme as the trained network does not contain any information regarding the tested subject. Note however that the train and test sets may still share data of same articulations (but not speakers). This point will be revisited in the discussion.
The results for the two evaluation schemes are presented in the section Results. All of the hyper-parameters of the methods taken from the literature are set to their default values mentioned in their corresponding studies. The training machine was made of an Intel Xeon w-2145 (3.70 GHz) CPU and a NVIDIA Tesla P100-SXM2-16GB GPU. Except for dlib, all the methods are trained on GPU. All the implementation are available online on GitHub (https://github.com/mohaEs and https://github.com/christianpayer/MedicalDataAugmentationTool).

Results
As noted in the section Methods, one of the 12 methods considered in this study, the DS-net, did not converge. For this reason, unless explicitly mentioned, the results cover only the remaining 11 methods. An overall comparison of the performances of the eleven methods are provided in Fig. 5 for both the CV and the LoSo schemes. It displays in box plots the Euclidean distances between the predicted and true landmark locations. For each box, the central mark indicates the median while the bottom and top edges indicate respectively the 25 th and 75 th percentiles. The whiskers extend to the most extreme data points not considered as outliers, the outliers being plotted individually using the '+' symbol.
Regarding the CV scheme, all the methods show good accuracy, with boxes below 2.5 pixels, except the MCAM method. This might be ascribed to the design of the method, optimized for joint localization and not anatomical landmarks as in the current case. Nevertheless, the method still shows decent results, with 75% of the distance errors being smaller than 5 pixels. On the other side, the dlib method presents the best results, possibly due to its boosting approach.
By attempting to predict landmarks on a subject not used to train the models, the LoSo evaluation scheme is more constraining and presents logically deteriorated -but more pertinent -results in comparison to the CV evaluation scheme. The results for the HyperFace, DAN, SFD and MCAM methods appear in particular significantly deteriorated. On the contrary, the deterioration appears more limited for the other methods and still lead to fairly good accuracy, with boxes remaining below 3.5 pixels.
The results in terms of RMSE and per landmark are provided in Fig. 6. It confirms the lower accuracy already noted for the LoSo scheme in comparison to the CV scheme. It also shows that the four methods HyperFace, DAN, SFD and MCAM estimate in the LoSo scheme many landmarks for more than 50% of the images with an error larger than 0.5 cm (5 pixels), rather problematic for speech production studies. The conv-net, U-net, SCN and SCN(U-net) methods display decent results in general but present serious errors on the PL and ET landmarks. On the contrary, the three methods dlib, p2p-GN and Flat-net still show acceptable results, with almost all landmarks for more than 70% of the images having an error lower than 0.5 cm. For these methods, the landmark EG is the most challenging one. It means that our method, the Flat-net, performs at least as good as the two best methods taken from the literature, namely the dlib and p2p-GN methods. They provide fairly good results in the Scientific RepoRtS | (2020) 10:1468 | https://doi.org/10.1038/s41598-020-58103-6 www.nature.com/scientificreports www.nature.com/scientificreports/ LoSo scheme despite the limited dataset. The other methods do not appear suitable to handle satisfactorily the data and problem presented in this study. A larger dataset with significantly more subjects and articulations might solve this issue.
A summary of the results and characteristics of the twelve methods of the study is reported in Table 2. Note that the storage space is reported for information but does not play in our eyes a critical role. Moreover, it depends on the data format and of the extent of meta-data stored. Also, the training time depends on the learning rate, the type of optimizer and the deep learning framework (Pytorh vs. Tensorflow, etc.). In overall, the best results in LoSo scheme are reported for our method, the Flat-net method, with a RMSE of 0.36 cm, slightly better than for the dlib (0.39) and p2p-GN (0.41) methods. Among all methods, the DL-based methods require significantly more training time and storage space. The p2p-GN method in particular has the largest number of weights to train and requires the largest space for storage. The dlib method is on the contrary rapidly trained, even on CPU, and requires the smallest space for storage.
The results of the t-tests conducted between the distance errors obtained by our method, the Flat-net method, also presenting interestingly the best performance, and the errors obtained by the other methods are presented in Table 3. It confirms a statistical significance between the means of the distance errors, all of them obtained with p = 0.000 except p < 0.04 for the SCN method.
The overall RMSEs per subject in LoSo scheme for the eleven methods considered in the results are displayed in Fig. 7. The p2p-GN and Flat-net methods tend to show more homogeneous and lower errors across subjects, suggesting that they are the two more accurate and robust methods to predict the landmarks for new and unseen subjects. Finally, some examples of practical results for the three best and the two worst methods are illustrated in Fig. 8.

Discussion
The present paper described an original method to localize anatomical landmarks of the vocal tract area on MRI images and compared its performance to 11 state-of-the-art methods of the literature. A dataset of midsagittal MRI from 9 speakers sustaining 62 articulations and annotated with the location of 21 landmarks has been considered. The methods have been evaluated through two schemes, a randomized 10-fold scheme and a leave-one-speaker-out scheme, considered as more challenging. Experimental results show the ability of all methods to cope with the problem in the 10-fold scheme but divergence of performance appear in the more challenging leave-one-speaker-out scheme. In general, our method, the Flat-net method, outperforms the other methods to solve the specific problem of the study and is only approached by two other methods, namely dlib and p2p-GN. Interestingly, these two methods include the only method not based on DL (dlib) as well as the method that has been adapted the most to fit at best our problem (p2p-GN). Note also that the heat-map-based methods present in general significantly better results than the other methods, supporting this approach to tackle the problem of landmark localization on medical images. This approach leads to networks without fully connected layers, usually www.nature.com/scientificreports www.nature.com/scientificreports/ used to transform the feature maps into vectors of landmark locations in output. This may result in an architecture possibly less prone to error propagation, especially for such a limited dataset. Indeed, although heat-maps are also somehow part of the DAN and MCAM methods, the design in channels together with the use of adapted networks able to output directly these heat-maps proved successful for all the other methods.
The good performance of the dlib method might be ascribed to the use of the boosting approach combined with the analysis of the input image in small regions by applying windows, possibly reducing the sensitivity to the variability of other regions of the image. In overall, our Flat-net method, the only method entirely developed in this study, tends to be more robust and to present better performances. One could object that the better performance of the Flat-net network may simply come from its hyper-parameters optimized for the current problem. The results obtained in the CV evaluation scheme suggest however that all eleven methods considered in the results -except the MCAM method to a certain extent -can perform the task with success, discarding this objection. These results support the approach using heat-maps in channels and networks without fully connects layers to localize landmarks of the vocal tract area on MRI data.    Table 3. p-values of paired t-tests for distance errors between the Flat-net and the remaining 10 methods considered in the results for the CV and LoSo schemes.
Regarding the localization of the landmarks, the overall RMSEs in LoSo scheme for the three best methods identified above and for a subset of 13 out of 21 landmarks are displayed in Fig. 9: the 11 landmarks mentioned in the section Methods plus the landmarks N on the one hand and NM and EG on the other hand, presenting respectively the best and worst results. Note that these results are a zoom of the results presented in Fig. 6a. All these landmarks are more accurately located by the Flat-net method, occasionally at the same precision than other methods. Except for NM, PL and EG, the RMSEs stay below 0.35 cm, a fairly good result considering the challenge associated with these landmarks and comparable to the overall RMSE achieved by the method. The landmarks UT and LT in particular, not visible on MRI and giving many problems in articulatory speech studies 48 , are estimated with a respective accuracy of 0.25 cm and 0.3 cm for new speakers without additional a priori information. Similarly,  It should also be noted that some landmarks do not exhibit very salient characteristics, such as the junctions between the wet and dry vermillion of the lips (ULV and LLV) or the sublingual cavity posterior point (TS) when the tip of the tongue is in a low position. Similarly, some regions tend to be hardly interpretable in terms of anatomy, such the anterior part of the larynx associated with the landmark EG. Annotating manually their exact location on the images was a challenging task at the first point, questioning the quality of the ground-truth data and possibly explaining the lower accuracy achieved for EG. Furthermore, a detailed analysis of the results such as presented in Fig. 8 reveals that the location of some landmarks may appear occasionally more accurate in output of the presented methods than in the original so-called ground-truth. This is a well known effect of machine learning methods, and DL methods in particular, which tend to avoid encoding noise and outliers by means of regularization techniques 50,51 . In general, a larger dataset labelled by several experts may limit the impact of the uncertainties in the ground-truth data and reinforce the robustness the DL methods and their resistance to noise.
The methods have been evaluated by means of two schemes, the CV and LoSo schemes. Strictly speaking, the most rigorous scheme would have been to leave both subject and articulation out, i.e. leave-one-subject-one-class-out (LoSoCo), to ensure that the network does not contain any information regarding the new tested image. In this scheme, all the articulations of one subject and one specific articulation for all subjects would be discarded in the training and the same specific articulation of the left subject would be tested. This would lead to 62 × 9 = 558 training sessions for each of the twelve methods. According to the times reported in Table 2, it would take more than a year, making this evaluation unrealistic in practice. However, the challenge of the problem lies rather in the estimation of the landmark locations for a new speaker than for a new articulation. Indeed, the corpus of 62 articulations can be considered large enough and representative of the French phonemic repertoire so that one articulation could fairly well be estimated from the 61 others 3 . For this reason, the LoSo scheme appears as a valid approximation to evaluate the methods on our problem.
In summary, the method proposed in this study (Flat-net) outperforms the state-of-the-art methods. It supports the description of landmarks locations in terms of heat-maps in channels and the generation of these heat-maps by means of DL networks without fully connected layers for such a variable and limited dataset. Future works may include the combination of successful features from the dlib method, showing very promising results, with DL approaches to create more robust methods, such as for instance the methods using deep forest networks 52,53 . In addition, since the accuracy of the heat-map-based methods for all landmarks are not same, it seems reasonable to combine them to form so-called machine learning ensemble methods. For example, the p2p-GN and Flat-net methods present good accuracy for the landmarks ET, PL, TE and TS but not for the landmarks EG and NPX while on the contrary the SCN method presents good accuracy for the landmarks EG and NPX but not for the landmarks ET, PL, TE and TS. Combining these methods to take advantage of their relative assets may therefore lead to higher accuracy in the localization of vocal tract area landmarks from MRI data together with more robustness. Furthermore, considering the recent rise of real-time MRI for speech production studies 12 , the next steps will be to adapt this technique to real-time MRI data.

Data availability
The source codes of the methods are available at https://github.com/mohaEs. The trained models are available from the author A.S. on reasonable request. Authors have not a permission to share the data.