Improving gait classification in horses by using inertial measurement unit (IMU) generated data and machine learning

For centuries humans have been fascinated by the natural beauty of horses in motion and their different gaits. Gait classification (GC) is commonly performed through visual assessment and reliable, automated methods for real-time objective GC in horses are warranted. In this study, we used a full body network of wireless, high sampling-rate sensors combined with machine learning to fully automatically classify gait. Using data from 120 horses of four different domestic breeds, equipped with seven motion sensors, we included 7576 strides from eight different gaits. GC was trained using several machine-learning approaches, both from feature-extracted data and from raw sensor data. Our best GC model achieved 97% accuracy. Our technique facilitated accurate, GC that enables in-depth biomechanical studies and allows for highly accurate phenotyping of gait for genetic research and breeding. Our approach lends itself for potential use in other quadrupedal species without the need for developing gait/animal specific algorithms.


Scientific Reports
| (2020) 10:17785 | https://doi.org/10.1038/s41598-020-73215-9 www.nature.com/scientificreports/ apparatus of quadrupedal animals. More recently, multidimensional approaches have been used 6,7 , challenging the old dogma. The introduction of sensor technology in motion studies allows easy collection of large amounts of highresolution, high-sample rate data 8 that can be used to train models for gait classification. Here, we used sensorbased data to investigate the accuracy of different classification models, based on machine learning technology. We have focused on two main methodologies to train classification models. One approach used a previously described algorithm 9 for feature extraction by calculating locomotion parameters from limb-mounted IMU sensors. Using this approach, several models were trained, demonstrating that the most important feature for proper gait classification in this approach is the (complex) interlimb relation. Application of this technique largely confirmed Hildebrand's theory, but also resulted in more accurate gait classification than the original approach, allowing for a refinement of the concept. Further, we have shown that a deep learning approach on raw IMU sensor data (i.e. not based on feature extraction) using a long-short term memory (LSTM) network can also be used to achieve high accuracy in gait classification. This indicates that the time-consuming task of generating animal-specific and gait-specific algorithms can be overcome and opens wide perspectives for the application of this approach in other animal species that are much less researched than the horse.
In this study we aimed at describing a method for accurate and fully automated gait classification in horses using data containing a unique number of gait varieties. We hypothesized that accurate gait classification could only be achieved using higher dimensional models. We further hypothesized that it would be possible to use deep learning techniques not requiring feature extraction, which are hence directly applicable to gait studies in other, less researched species, with similar accuracy.

Results
The footfall pattern and the sequence of footfalls can be defined for each gait (Fig. 1A). Some specific features of the gaits can easily be identified, such as symmetry and laterality. However, for some gaits such as walk, tölt and paso fino, these variables do not fully discriminate between the gait classes. Similarly, other discriminating features, such as the stride temporal variables (Fig. 1B, Table 1), can be differentiating enough for some gaits, such as for example stance duration for the walk (0.65 ± 0.12 s) and the trot (0.28 ± 0.05 s), but in other gaits some of these features overlap, such as stance duration for paso fino (0.21 ± 0.01), and trocha (0.2 ± 0.03). This indicates that multidimensional classification models are required for the comprehensive classification of all gaits.
Some features are characteristic for specific breeds (Fig. 1C), although some of these differences might also, to some extent, be attributed to conformation and speed. We have made an overlay of our data with data generated by the original classification formula for symmetrical gaits of Hildebrand (Fig. 1D). Although each of our measured gaits falls grossly within the previously described regions, it is evident that the reality is more complex: the overlap is not perfect and the spectra within each gait are broader and less distinct than depicted by the original two-dimensional scheme. Further, the grouping of the different gaits on the 2D plot, such as between pace and tölt, is not clear from the original drawing; these two gaits appear to have a large region of overlap.
Gait classification based on feature extracted models. For all the different methods applied, the highest accuracy for classification was obtained when all variables were used, achieving a classification accuracy of 96.7% using a fully connected (FC) artificial neural network followed by 96.3% using the support vector machine model (Table 2). If gait classification was based only on stride variables (e.g., stride duration and duty factor), poor classification accuracy was achieved. With the classification based on the two variables of Hildebrand (duty factor and lateral advanced placement), GC achieved a slightly higher accuracy, peaking at 78.7% using a decision tree. The highest confusion between classes was observed between the gaits trot and trocha for all classification models, followed by the confusion between pace and tölt (Fig. 2). Removing the trocha from the models increased the final accuracy of the best performing FC model to 98.6%.
Gait classification based on raw IMU data and LSTMs. Classification using LSTMs on the raw normalized sensor data achieved a high classification accuracy, peaking at 95.5% (Table 3). A longer window length had a negative effect, especially when fewer sensors were used (Table 3). Using bidirectional vs unidirectional LSTMs did not affect the general accuracy of each model, although the highest accuracy was achieved with a bidirectional LSTM model.
Gait classification based on a single sensor yielded poor accuracies, peaking at 79.9% only. Training based on sensors mounted on the upper body of horses, mainly head, withers and pelvis, yielded significantly higher accuracies (92.3%) and adding one limb sensor, pushed the accuracies only slightly higher (93.3%), achieving similar accuracies as the models relying solely on all four limbs (92.7%). The highest accuracy was observed using and training the network with the data from all available IMUs, this is, headwithers, pelvis and all four limbs (95.5%). For the best performing models, confusion was highest between the classes trot and trocha, in line with our results from the gait classification models based on feature extraction. Excluding trocha from the data set, yielded a classification accuracy of 98.9%.

Discussion
In this study we have demonstrated that accurate gait classification in horses can be achieved using state of the art body mounted sensor technology in combination with multiple machine learning data analysis approaches. Through this technology we were able to extend Hildebrand's original equine gait paradigm from 1965 5 , showing that reality is more complex and ambiguous, and less straightforward than the original concept, as shown in Fig. 1 www.nature.com/scientificreports/  www.nature.com/scientificreports/   www.nature.com/scientificreports/ that gaits are in fact separated by multidimensional planes and that accurate classification can be achieved for this unique diverse gait data using automated approaches that include minimal preprocessing of the signal. The human eye has thus far served as the 'gold standard' for gait classification. It is clear from the current study, however, that human visual and subjective assessment is not optimal for this purpose. This observation is in line with other studies evaluating human assessment of equine locomotion, mainly in relation to the evaluation of lameness in clinical situations. There too, human subjective assessment proved suboptimal, as it was affected by both the temporal limitations of the human eye 10 and the proneness to bias 11 .
Our models used in this study open a new world of possibilities, for example for research into genetics of gait. Most equine genetic studies focusing on locomotion, either related to gait 3,12 or sports performance 13 , require precise phenotyping in order to discriminate between trends in populations or sub-populations. Gait phenotyping is still performed subjectively in most of these studies and thus much less accurate than desirable; we therefore believe that our more accurate methods will allow forthcoming studies to understand the genotype-phenotype association of gaits in greater detail.
Our models using raw sensor data (i.e., LSTMs) achieved a slightly lower accuracy when compared to the feature-extracted models. Nevertheless, the difference was marginal and there is great advantage in using models based on raw sensor data. It is extremely challenging and time consuming to develop specific algorithms for feature extraction [14][15][16] . These algorithms require validation, and they risk being gait, surface and ultimately, breed specific. Pre-selecting variables also brings the risk of missing information in the data that can be useful for complex classification tasks. When using raw sensor data, the models can be applied to any gait, horse breed and surface, provided that enough labeled training and validation data exist for the development of such models. Hence, this approach is far more widely applicable and opens new possibilities for the study of all gait spectra, not only in the horse, but also in other quadrupedal species.
Window length has a significant effect on the accuracy of our models, and we see a decrease of accuracy with increasing window length (Table 3). We hypothesize that this is related to the fact that segmenting the data into shorter windows results in a larger number of samples that are used as input for model training. Also, longer windows might include more data points where transitions of gait or incorrect strides (e.g. stumbling) occur, and this will ultimately influence the overall accuracy of correctly classifying the entire segment. In theory, it should be possible for the network to learn from longer windows, but we suspect that this would require a larger number of longer samples. For the longer windows, one input sample could contain multiple strides, due to the cyclical nature of the gait data. It is possible that the network learned to disregard the repeated strides of one window if these did not immediately give more information about the coarse-grained class. This way, features capturing the more subtle variation within the strides of one gait might have been lost.
Despite the large influence speed has on temporal variables, such as step duration 17 , our models were able to achieve a high accuracy without a strict control of speed. Hence, we hypothesize that speed might not be a crucial parameter for gait classification. It is therefore questionable if the speed range for each gait in this study did cover the actual variability within each breed. Retraining our models with more data at different speed ranges will improve this in the future.
We have found a high degree of confusion between trot and trocha in our study. This may of course be caused by mislabeling of some of the horses used in the training and validation groups. A recent study described the trocha as often being less 'clean' in terms of foot fall timing, possibly related to genetic profiles 12 . Another important issue is the close relation between these two gaits (Fig. 1D). Inclusion of more variables in our models might have allowed for a better separation from the trot, but a close observation of the distribution of trocha versus trot classifications in Fig. 1D also raises doubt whether what is called trocha is not just part of the spectrum of trot, but with a high stride frequency. This warrants further research.
One of the main limitations of the current study is the narrow band of horse breeds used. However, the breeds included in this study were selected to exhibit a variable spectrum of different gaits, and in fact increasing the variation in our population. Overcoming this limitation will be a matter of time, however, because the methods described in this study are adaptive; collection of more data-in other breeds, or even in different species-will lead to better trained models and improved generalization. Future exploration of the machine learning models' decision process could lead to invaluable insights in locomotor steering.

Methods
Data set. Data were collected between 2016 and 2019 using seven IMU sensors (Promove-mini, Inertia Technology, The Netherlands) (Fig. 3A). Sensors were attached to the poll, withers and pelvis of all horses, and set to a sampling frequency between 200 and 500 Hz, low-acceleration range of ± 8 g, high acceleration range of ± 100 g and angular velocity of 2000 deg/s. Each limb was also equipped with an IMU sensor, attached to the lateral aspect of the metacarpal/metatarsal bone, and set to a sampling frequency between 200 and 500 Hz, low-acceleration range of ± 16 g, high-acceleration range of ± 200 g and angular velocity of 2000 deg/s. Synchronization between sensors, initial data processing and limb stride parameter calculation were performed as previously described 9,18 .
Data sets (Table 4) were collected for different research purposes, such as studying objective motion analysis methodology in sound speed-dependent motion patterns in warmblood riding horses and Franches-Montagnes horses and studying gaits and phenotype-genotype associations in gaited horse breeds (Icelandic horses and Colombian horse breeds).
For each data set (Table 4) www.nature.com/scientificreports/ participants were included in this study. Informed consent for publication has been obtained from the rider in Fig. 3A. All data included in the training and validation of our study were from horses whose athletic performance was normal and they were, to the owners'/trainers' best knowledge not lame.
Labeling of the data. For the data sets of the Icelandic horses each measurement was synchronized with a video camera, since each measurement contains several segments of different gaits (walk, trot, pace, tölt and canter). This video was evaluated by a domain expert of gaits of the Icelandic horse (VG), who selected the segments of data that should be used for training and validation. For the Colombian criollo horses, the segments of data used for the analysis were selected based on visual inspection of the footfall pattern during live observation of the trials by an expert in locomotion of this horse breed (MN). For the remaining trials, selection of the segment of each gait was performed by live observation by an expert in equine biomechanics (FSB).  www.nature.com/scientificreports/ Preparation of the dataset. Based on the labeled segments of data used for training, a data set was generated. The data set consists of two main parts Fig. 3B, (1) features extracted from the raw IMU data, consisting of stride parameters (Table 5) calculated based on a previously described algorithm 9 resulting in 7576 strides; (2) segments of the raw IMU data, prepared for the analysis using the LSTMs. Each segment was further cropped into subsections of one, two or three seconds of IMU data. All data were resampled to 200 Hz to match the temporal resolution among all used data sets. A total of 5344 s of raw IMU data were used.
Data analysis. Data processing, analysis and model training was performed in Matlab 2018b (MathWorks, Natick, Massachusetts, USA). Seven supervised machine learning methods were applied to the gait classification task: linear and quadratic discriminant analysis (LDA and QDA), decision trees, random forest, support vector machine (SVM) a one-layer fully connected (FC) neural network (Fig. 3C2) and a Long-Short Term Memory (LSTM) neural network (Fig. 3C1). With an SVM, as well as with LDA and QDA, we try to learn the decision boundaries that will maximally separate the different classes of our classification problem. In LDA and QDA, we model the data as Gaussian distributed. While in LDA models all classes have the same covariance matrix, QDA has a separate covariance matrix for each class and can thus model more complex decision boundaries. Decision trees are a non-parametric method where the model is trained to split the data according to the most distinguishing features for the different classes. A random forest is an ensemble of decision trees. FC and LSTM are artificial neural network methods, highly parametric as such, that are trained to approximate the function mapping between the input data (raw sensor data or features extracted from sensor data) and the gait class. The FC model was composed of an input layer of extracted features (Table 5), connected to a hidden layer with a size of 40 neurons, connected to an output layer, representing each one of the output gait classes (walk, trot, left canter, right canter, tölt, pace, trocha and paso fino). The LSTM model was built with an input layer consisting of a sequence of 1, 2 or 3 s of IMU data, connected to two LSTM layers with a width of 500, followed by an FC layer, a softmax layer and a classification layer representing each one of the output gait classes (walk, trot, left canter, right canter, tölt, pace, trocha and paso fino).
For the LSTM, the gyroscope and accelerometer data were normalized between 0 and 1, ensuring that the network will learn the specific gait pattern since we have observed gait-specific characteristics in the magnitude of the signals like for example, higher peak accelerations at trot when compared to walk. Also, gait classes with less data were duplicated in the data set to remove any unbalance present in the data prior to training. Training was performed on a single NVIDIA Tesla K80 GPU with 4992 CUDA cores.
The entire data set was randomly divided in two sub-data sets, one used for training, validation and one for testing. We have ensured that strides of the same horses were never used for training and testing simultaneously, with the goal of avoiding overfitting. Each model was cross-validated using 5 folds and the results presented in Tables 2 and 3 are the mean accuracy and standard deviation of the 5 folds. Based on the best mean validation accuracy, one feature extraction model and one raw data model was selected for testing, the results are presented in Fig. 2A and B.