Digitally-enhanced dog behavioral testing

Behavioral traits in dogs are assessed for a wide range of purposes such as determining selection for breeding, chance of being adopted or prediction of working aptitude. Most methods for assessing behavioral traits are questionnaire or observation-based, requiring significant amounts of time, effort and expertise. In addition, these methods might be also susceptible to subjectivity and bias, negatively impacting their reliability. In this study, we proposed an automated computational approach that may provide a more objective, robust and resource-efficient alternative to current solutions. Using part of a ‘Stranger Test’ protocol, we tested n = 53 dogs for their response to the presence and neutral actions of a stranger. Dog coping styles were scored by three dog behavior experts. Moreover, data were collected from their owners/trainers using the Canine Behavioral Assessment and Research Questionnaire (C-BARQ). An unsupervised clustering of the dogs’ trajectories revealed two main clusters showing a significant difference in the stranger-directed fear C-BARQ category, as well as a good separation between (sufficiently) relaxed dogs and dogs with excessive behaviors towards strangers based on expert scoring. Based on the clustering, we obtained a machine learning classifier for expert scoring of coping styles towards strangers, which reached an accuracy of 78%. We also obtained a regression model predicting C-BARQ scores with varying performance, the best being Owner-Directed Aggression (with a mean average error of 0.108) and Excitability (with a mean square error of 0.032). This case study demonstrates a novel paradigm of ‘machine-based’ dog behavioral assessment, highlighting the value and great promise of AI in this context.


Introduction
Behavioral traits in animals are consistent patterns of behaviors exhibited in similar situations 1,2 .They are driven by personality 3 , which is a complex combination of genetic, cognitive, and environmental factors 4 .The assessment of personality traits in dogs is gaining increasing attention due to its many practical applications in applied behavior [5][6][7][8] .Some examples of such applications include determining the suitability of working dogs 9 , identifying problematic behaviors 10 , and adoption-related issues for shelter dogs 2 .
Measuring behavioral traits of dogs has been an enigmatic challenge in scientific literature for decades.The methods used can be roughly divided into two types.The first type refers to experimental behavioral tests (e.g., observations of the dog's behavior in a controlled novel situation, such as the Strange Situation Test 11 ).Brady et al. 12 provide a systematic review of the reliability and validity of behavioral tests that assess behavioral characteristics important in working dogs.Jones and Gosling 13 provide another comprehensive review of past research on canine temperament and personality traits.In a complementary manner, Bray et al. 14 reviewed 33 empirical studies assessing the behavior of working dogs.Tests for detection dogs have also been addressed [15][16][17] .Another common assessment method is a questionnaire completed by the owner or handler .Examples include the Monash Canine Personality Questionnaire 18 , the Dog Personality Questionnaire 19 , VIDOPET 20 and many more.One of the most well-known questionnaires, used in many contexts, is the Canine Behavioral Assessment and Research Questionnaire (C-BARQ).Originally developed in English 21,22 , it has been validated in a number of languages, including Dutch 23 .
Although questionnaires can reduce the time and effort required for behavioral testing, they have serious limitations: they are susceptible to subjectivity and misinterpretation, and can be biased by the bond with the animal being assessed.Moreover, an individual with sufficient knowledge of the dog in order to reliably complete the items is not always available 12 , especially in the case of working or shelter dogs.
For instance, in the context of owner-observed assessment of stress, Mariti et al. have argued that "The results show that some owners can help in protecting the welfare of their dogs, but that many owners would benefit from educational efforts to improve their ability to interpret their dogs' behavior." 24.Moreover, Mariti et al. have shown that "most dog owners report having a good understanding of the emotional state of their dogs, when they seem to have a low appreciation of the signals that dogs send in the earlier stages of emotional arousal" 25 .Even seemingly clear physical observations, such as obesity in dogs, have been shown to lead to frequent disagreements between owners and veterinarians 26 .Furthermore, Rayment et al. 27 criticize the "lack of robust assessment of the validity and reliability of many test protocols currently in use 'on the ground'", referring in particular to the use of psychometric instruments that rely on an unambiguous shared understanding of terminology, which is difficult to achieve in a population with different levels of education, knowledge of animal behavior, (first) languages, etc.Moreover, psychological factors of the human observers influence their evaluation of dogs 28 , which further complicates the use of psychometric data from a wide variety of participants as a homogenous dataset of observations.
The goal of this exploratory study is to investigate a novel idea of a digital enhancement for behavioral testing, which in time may be integrated into relevant interspecies information systems 29 to understand animal behavior.Using as a case study a simple behavioral testing protocol of coping with the presence of a stranger, currently implemented to improve breeding of working dogs in Belgium, we ask the following questions: • Can machine identify different 'behavioral profiles' in an objective, 'human-free' way, and how do these profiles relate to the scoring of human experts in this test?
• Can machine predict scoring of human experts in this test?
• Can machine predict C-BARQ factors of the participating dogs?
The findings of our study provide positive answers to these questions.Dog trajectories obtained by automated tracking of dog movements were clustered using an unsupervised k-mean clustering algorithm, revealing two distinct clusters corresponding to different human expert scoring categories (neutral vs. excessive behavior towards a stranger).Based on the clustering, we obtain a machine learning classifier that predicts expert scores with over 78% Accuracy.We further obtain a regression model that predicts C-BARQ factor scores with varying performance, the best being Owner-Directed Aggression (with a mean average error of 0.108) and Excitability (with a mean square error of 0.032).We discuss the potential applications of the proposed novel paradigm of digitally enhanced canine behavioral testing.

Methods
Ethical statement.All experiments were performed in accordance with relevant guidelines and regulations.The experimental procedures and protocols were reviewed by the Ethical Committees of KU Leuven and University of Haifa, in both ethical approval was waived.

Testing Arena
The test was conducted indoors, in a room free from other distractions such as other animals or people (except the test person, the assistant, and the familiar person/owner).The size of the test arena was 6*8 m (length x width) with a height of 3 m.An adjacent, separate room is available where the dog and the owner are received and can wait out of sight of the test arena.The room containing the test arena has a separate door, so that the test person can enter the test arena without having to pass by the dog and owner.In this way, the test person is completely new to the dog until the start of the actual test.The test arena was enclosed with metal fences (height of 0.8 m, length x width of 4.7 by 4.7 m).In the middle of the test arena, a square of 60 by 60 centimeters is drawn with tape for positioning the chair of the test person.The test person faces the front of the fenced arena.This is the side that contained the gate and from where the assistant recorded the test on video.In the left corner (frontal view), there is a chair for the familiar person, positioned parallel to the front fence.A second square (3 x 3 m), centered around the middle, was marked with tape on the floor.These lines indicated the track to be followed when the owner or test person walked in the test arena.
Two video cameras were used to record the activity and the behavior of the dog during the test, a top view and a side view camera.As top view, a GoPro video camera was mounted in the middle of the test arena at a height of approximately 3 m, so that the entire test arena was covered -see Figure 1.A side-view camera was held and operated by the assistant, at a distance between 1.5 and 2 m from the middle of the nearest fence.

Test Procedure
The protocol used below is part of a more elaborate testing protocol developed by one of the authors (JM).Its purpose is to assess the reactions of dogs to the presence of an unfamiliar person (during inactivity or during benign actions), both in the presence and absence of a familiar person (i.e., the owner or a regular handler/trainer).The familiar person is instructed about the test and asked not to interact with the dog, expect to unleash or put the leash on the dog.For the purposes of this study, only the exploration phase and the first phase occurred.The unfamiliar person, i.e., the test person, was always the same adult female (JM).The assistant was also always an adult female, but not always the same person.

Exploration Phase
Prior to the first phase of the behavioral test, all dogs were allowed to explore the test arena in the presence of their owner, but without the presence of the test person.An assistant led the familiar person into the test arena with the dog on a leash.While the assistant moved to the location from where the next phase would be filmed, the familiar person entered the test arena with the dog, closed the gate and unleashed the dog.The familiar person then walked the 3m x 3m square counterclockwise 1.75 times, arriving at and sitting on the chair positioned on the side.After three minutes, the familiar person walked to the dog and put on the leash.They both left the test arena and went into the adjacent room.

Testing Phase
After the exploration phase, the test person entered the test arena and sat on the chair in the middle, feet in parallel and firmly planted.The test person held a smartphone as a timer.The assistant called in the familiar person and the dog.They entered the test arena, the familiar person closed the gate, unleashed the dog, walked directly to the chair and sat down.After the familiar person sat down, the test person performs three actions: a short, clear cough (at 10 s), a hand running through the hair for 3 seconds (at 20 s), and crossing the right leg over the left (at 30 s).These are actions that can be expected from any human being and that all dogs will encounter when they are around people.An example trial can be found here.Except when running her hand through her hair or when a dog jumps up, the test person held the smartphone in both hands, resting on her lap.The test person did not look at the dog or performed any actions towards it.If a dog jumped up excitedly, the test person protected her face/head with her hands/arms as needed.In this study we analyze this testing phase.

Study Subjects
A total of n=53 dogs were tested in the study.Their owners were recruited through social media in Belgium.The inclusion criteria for the dogs were: • Age: between 1 and 2 years old.
• Up-to-date vaccinations and no history of health problems.

3/11
• Accompanied by an owner or a familiar person.
• Belonging to the modern dog breeds.

Dog Scoring
The original scoring method, an 11-point scale ranging from -5 to +5 was developed by JM.It is based on an adaptation of the concept of coping with potential threats via freeze/flight versus fight 30 .The idea is to differentiate between two main tendencies of dogs when reacting to a stressor (in this context -an unfamiliar person -reactions to the assistant or familiar person were ignored): dogs that tend to 'react towards the stressor' (e.g.get very close to the test person, jump up, chew, show offensive aggression. . . ) receive numerically positive scores, and dogs that tend to 'react away from the stressor' (e.g.keep at a distance, avoid, show defensive aggression. . . ) receive numerically negative scores.Score 0 (neutral) indicates good and stable coping with the stressor.
The original scoring method is currently used by the Belgian assistance dog breeding organization, Purpose Dogs vzw, 1 to improve breeding outcomes.In order to facilitate the scoring by the independent experts, the original scale ranging from -5 to +5 was modified into a five-point scale ranging from -2 to +2 (-5 and -4 were scored as -2; -3 and -2 as -1; -1, 0 and +1 as 0; +2 and +3 as +1; +4 and +5 as +2).Since this category includes as well the -1 and +1 on the 11-point scale, a score of 0 on the five-point scale means that the dog is sufficiently relaxed in the presence of the test person.In the case of Purpose Dogs vzw, for example, dogs with this score would be suitable for selection for future breeding.The analysis for the purpose of this study was further simplified by grouping the negative (-2 and -1) and positive (+2 and +1) scores, respectively, resulting in three groups: "+", "0", and "-".
The testing phase (phase 1) was evaluated by three experts.Two of the experts were animal behavior researchers (JM and CPHM), and one expert was a veterinarian (EW).Multi-rater (Fleiss) kappa on the scores (n=53) collapsed into three classes ('-', 0, '+') reached a percentage of agreement of 85%; Fleiss free-marginal k=0.77 indicating good strength of inter-rater reliability.To obtain an overall score for each dog, the score representing the majority (two out of three scores) was selected as the final score.Negative scores had only 3 samples, so this category was excluded from our analysis.Our final dataset included 50 samples, out of them 32 samples with a zero/neutral score (26 full agreement by all coders, 6 by majority) and 18 samples with a positive/excessive score (12 full agreement by all coders, 6 by majority).

C-BARQ Questionnaire
The Canine Behavioral Assessment and Research Questionnaire (C-BARQ) is an instrument originally developed in English 21,22 and validated in multiple languages, including Dutch 23 .
The dog owners were asked to complete a Dutch2 version of the C-BARQ questionnaire.

Computational Approach
Figure 2 provides an overview of our approach for digital enhancement of dog behavioral testing.The purpose of any behavioral test is, eventually, to observe behaviors in response to various stimuli in a controlled and standardized environment.Based on a specific testing protocol, a scoring method is usually developed and evaluated for use by human experts.The practical aim of such scoring is to classify the elicited behaviors into categories (e.g.corresponding to specific behavioral traits or profiles) that can eventually be used for decision support.With the machine entering the scene, we have an alternative, mathematical and completely human-free way of "scoring" behaviors, or dividing them into categories.Since this test focuses on human-directed behavior, we assume that the participants' trajectories contain meaningful behavioral information about their reaction to the stranger.Therefore, we automatically extract and cluster the participants' trajectories, investigating the relationship of the emerging clusters to experts' scoring, and compare how well they align.In addition, we also examine the correlation of the emerging clusters with C-BARQ factors.Finally, we investigate whether we can use the obtained clusters to predict (i) the expert score, and (ii) C-BARQ factors.
In what follows we provide further details on the tracking method, the clustering method and the machine learning models for prediction of the above.The BLYZER system is a self-developed platform that aims to provide a flexible automated behavior analysis which has been applied in several studies for analyzing dog behavior [31][32][33][34] .A similar approach was implemented on a smaller portion of the dataset used in this study in 35 , however in contrast to our approach here, features chosen manually were used for clustering.
BLYZER's input is video footage of a dog freely moving in a room and possibly interacting with objects, humans or other animals, while its output is time series represented in a json file with the detected locations of the objects in each frame.Figure 3 shows the pipeline, highlighting the fact that both tracking method (the models used for detection) and the scene (amount of moving and fixed objects) are easily adapted.In the configuration used in this paper, the tracking method was chosen to be a neural network based on the Faster R-CNN architecture 36 pre-trained on the COCO 2017 dataset 37 , which we retrained on additional 106,768 images of two objects: a person and a dog.The images were collected from (1) Open image dataset V6 38 (2) Pascalvoc dataset 39 (3) COCO dataset 37 (4) Images from previous studies 33,34 .Figure 4 shows example frames from our dataset with dog and test person object detection.And figure 5 presents examples of participants' trajectories extracted with BLYZER.Quality of detection.To ensure sufficient tracking, only videos with a percentage of frames where dog and person are correctly detected of least 80% of the frames, leading to the exclusion of three videos (all three scored with a zero/neutral score).For the remaining 47 videos, we applied post-processing operations available in BLYZER to remove noise and enhance detection quality using smoothing and extrapolation techniques for the dog and test person detection, reaching almost perfect (above 95%) detection.

Clustering Method
The videos from the trails are initially analyzed by the BLYZER tool which produces for each frame the center of mass of the dog and person in the frame (if detected).To assure a smooth motion capture while standardizing between trials, we set 24 frames per second (FPS) rate across all videos.For frames that the BLYZER tool was not able to detect either the dog or the person (or both), it linearly extrapolates their positions to fulfill the gap.In addition, since not all videos were of identical duration, we used the duration of the shortest video as standard duration.As such, each trial (s ∈ R 2m ) is defined by a time series with a fixed duration between samples constructed by two vectors, one for the dog's position (d ∈ R m ) and the other for the person's position (p ∈ R m ).As a result, we obtain a dataset, D ∈ R n×m .This is the times series data depicted in Figure 6, which presents the whole data analysis pipeline.
For clustering trajectories, we used the time-series K-mean clustering algorithm 40 with the elbow-point method 41 to find the optimal number of clusters (k).Nonetheless, as the raw center of mass is not quite an informative space, we decided to first transform the data into a "movement" space.To this end, we trained a small-size one-dimensional convolutional neural network (CNN) based AutoEncoder model 42 with the following architecture for the encoder: Convolution with a window size of 3, dropout with p = 0.1, Convolution with a window size of 3, dropout with p = 0.1, max-pooling with a window size of 2. Clearly, the decoder's architecture is opposite to the encoder's one.We used a mean absolute error as the metric for the optimization process and the ADAM optimizer 43 .The model's hyperparameters are found using a grid-search 44 .Using the encoder part of the model that was used after training the AutoEncoder, we computed the "movement" space of each sample for the clustering.Once the clustering is obtained, the clusters were evaluated in two ways: (i) expert scoring metrics, and (ii) Mann Whitney U test on C-BARQ factors.

Classification and regression machine learning models
The clustering was further used to obtain classification and regression models for predicting scoring (0/'+') and C-BARQ factors respectively.We use the Tree-Based Pipeline Optimization Tool (TPOT), the genetic algorithm-based automatic machine learning library 45 .TPOT produces a full machine learning (ML) pipeline, including feature selection engineering, model selection, model ensemble, and hyperparameter tuning; and shown to produce impressive results in a wide range of applications [46][47][48] .Hence, for every configuration of source and target variables investigated, we used TPOT, allowing it to test up to 10000 ML pipelines.We choose 10000 to balance the ability of TPOT to converge into an optimal (or at least close to optimal) ML pipeline and the computational burden associated with this task.
The obtained classification model performance for expert scoring was evaluated using commonly used metrics of accuracy, precision, recall, and F 1 score.The obtained regression model performance for C-BARQ factors was evaluated using Mean Absolute Error 49 (MAE), Mean Squared Error 49 (MSE), and R-squared 50 (R 2 ).

Clusters correlation with scoring
Using the elbow method, two clusters emerged of sizes 26 and 20 respectively.One sample was excluded due to being an outlier.As shown in Table 1, there is a quite good separation between zero/neutral scores and positive/excessive scores: the first cluster had the majority of participants (n=21) scoring 0, while only 5 scored '+'.The second, the majority (n=13) scored '+' while 7 scored 0.

Clusters correlation with C-BARQ
A significant difference between the two clusters was found with respect to Stranger-Directed Fear (SDF) (median of cluster 1: 0.00; cluster 2: 0.42) (Mann Whitney U=120.

Discussion
This study is another contribution to the growing field of computer-aided solutions for "soft" questions using data-driven based methods [51][52][53][54] .To the best of our knowledge, this study is the first to provide a machine-learning model for objectively scoring a strictly controlled dog behavioral test.
In this study we used as a case study a Stranger Test routinely performed in a working dog organization to ask the following questions: • Can machine identify different 'behavioral profiles' in an objective, 'human-free' way, and how do these profiles relate to the scoring of human experts in this test?
• Can machine predict scoring of human experts in this test?
• Can machine predict C-BARQ factors of the participating dogs?
Our results indicate positive answers to all of the above questions.Answering the first question, using unsupervised clustering, two clusters emerged, with a good separation between the two different scores 0/'+'.Answering the second question, we presented a classification model for predicting human scoring reaching 78% accuracy.Answering the third question, we presented a regression model which is able to predict Excitability and Owner-Directed Aggression with minimal error.These results demonstrate the potential of computational approaches in the domain of dog behavioral testing.
It is important to stress that the computational approach to the assessment of dog behavioral testing proposed here is 'human-free'.The agenda for a 'human-free' computational analysis of animal behavior was introduced in Forkosh 55 .The author argued that despite the fact that automated tracking of animal movement is well-developed, the interpretation of animal behaviors remains human-dependent and thus inherently anthropomorphic and susceptible to bias.Indeed, in previous works applying computational approaches in the context of dog behavior 31,34,56,57 , features used for machine learning are explicitly selected by human experts.
By using such "human-free" clustering, two clusters emerged, roughly dividing the participants into a cluster of 'neutrally reacting' dogs with the majority scoring 0, and a cluster with a majority of 'excessively reacting' dogs scoring '+'.Interestingly, these clusters showed a significant difference in the Stranger-Directed Fear C-BARQ factor.However, a regression model for predicting this factor did not have a very good performance, with the best performance being the Owner-Directed Aggression factor and Excitability.The latter could be related to the excessive behaviors typical of the '+' scoring that matched the response of dogs as measured by the C-BARQ "displaying strong reactions to potentially exciting or arousing events"3 .Further research is needed to establish clearer relationships.
The testing protocol used in our study refers to one specific aspect (towards/neutral/away from stressor) of stranger-directed behaviors.This protocol is used in a working dogs organization for breeding outcome improvement and has been previously studied in the context of automation of tracking 56 , also exploring some preliminary ideas of clustering (unlike the 'human-free' approach presented here).An in-depth exploration and scientific validation of this test is beyond the scope of the current study, we chose to use just one phase of this protocol due to its simplicity for automating tracking.Future research is needed for extending the presented approach to other phases of this protocol, as well as to other types of behavioral testing.
Our approach in this study was validating the emerging clusters using expert scoring as a golden standard.However, this approach could be reversed in future studies, using mathematical, objective clustering as a 'ground truth' for testing various scoring schemes for behavioral testing protocols.For now we treat the machine as enhancing human capabilities, however a day may come when this situation will be reversed, with the machine being the more objective and reliable way of analyzing behavioral testing data.It is our hope that this preliminary study will stimulate discussions on the value and great promise of AI in the context of dog behavioral testing.

Figure 1 .
Figure 1.The testing arena, captured from the top camera; stranger sitting in the middle, handler in the corner.

Figure 2 .
Figure 2. A conceptual framework for digitally enhanced dog behavioral testing

Figure 4 .
Figure 4. Frames example of participating dog and test person being tracked by Blyzer.

Table 1 .
5, z=-2.56,p=0.01).No other factors of the C-BARQ had a significant difference Cluster description in correlation with expert scoring.

Table 2 .
Evaluation metrics.withrespect to the clusters.Table2presents the performance of the expert score classification model, reaching accuracy of above 78%.Our findings revealed varying levels of error across the eight C-BARQ factors presented above.Owner directed aggression (ODA) yields the lowest MAE of 0.108 and second lowest MSE of 0.046.Excitability (EXC) has the second-lowest MAE of 0.122, and the lowest MSE of 0.032.Stranger directed aggression (SDA) and Separation related behavior (SRB) demonstrated slightly higher MAE values of 0.275 and 0.257, respectively, with corresponding MSE values of 0.129 and 0.144.Moreover, Pain sensitivity (PS), Stranger directed fear (SDF), Non social fear (NSF), and Attachment seeking behavior (ASB) exhibited MAE values ranging from 0.435 to 0.51, with MSE values ranging from 0.287 to 0.43.Furthermore, the R 2 values provided insights into the proportion of variance explained by the factors.EXC showed the highest R 2 value of 0.886, indicating a strong fit between the EXC factor and the time-series data.SDA and PS exhibited moderate R 2 values of 0.47 and 0.429, respectively, signifying a reasonable level of explanatory power.Moreover, ODA and ASB demonstrated low R 2 values of 0.176 and 0.142.While SRB, NSF, and SDF showed the lowest R 2 values of 0.073, 0.043, and 0.032, indicating a weak association with the time-series data.These outcomes shed light on the predictive performance of the model and the varying impacts of the C-BARQ factors on the outcome.

Table 3 .
Regression model metrics per C-BARQ factor, sorted by MAE.