Use of mouse-tracking software to detect faking-good behavior on personality questionnaires: an explorative study

The aim of the present study was to explore whether kinematic indicators could improve the detection of subjects demonstrating faking-good behaviour when responding to personality questionnaires. One hundred and twenty volunteers were randomly assigned to one of four experimental groups (honest unspeeded, faking-good unspeeded, honest speeded, and faking-good speeded). Participants were asked to respond to the MMPI-2 underreporting scales (L, K, S) and the PPI-R Virtuous Responding (VR) scale using a computer mouse. The collected data included T-point scores on the L, K, S, and VR scales; response times on these scales; and several temporal and spatial mouse parameters. These data were used to investigate the presence of significant differences between the two manipulated variables (honest vs. faking-good; speeded vs. unspeeded). The results demonstrated that T-scores were significantly higher in the faking-good condition relative to the honest condition; however, faking-good and honest respondents showed no statistically significant differences between the speeded and unspeeded conditions. Concerning temporal and spatial kinematic parameters, we observed mixed results for different scales and further investigations are required. The most consistent finding, albeit with small observed effects, regards the L scale, in which faking-good respondents took longer to respond to stimuli and outlined wider mouse trajectories to arrive at the given response.

Between September and October 2019, an additional 120 young adult volunteers were recruited as an out-of-sample evaluation group for the model built on the original sample. Participants were recruited in the same way as the prior sample and met the same inclusion/exclusion criteria. They received no reward for their participation. All subjects were aged 18 to 29 years old (M = 22.73; SD = 2.84); half were male and the other half were female; all were Caucasian. Participants were randomly assigned to one of four experimental groups, following the same manipulation of factors, instructions (honest vs. faking-good), and time pressure (speeded vs. unspeeded) as the original sample. Group 1 (N = 30) (M age = 23.53; SD = 2.70) was an honest-faking-good unspeeded (H/FG-U) group; group 2 (N = 30) (M age = 21.97; SD = 2.57) was a faking-good-honest unspeeded (FG/H-U) group; group 3 (N = 30) (M age = 22.67; SD = 2.91) was an honest-faking-good speeded (H/FG-S) group; and group 4 (N = 30) (M age = 22.77; SD = 3.08) was a faking-good-honest speeded (FG/H-S) group. No statistically significant differences were observed between groups with respect to age. As the statistical analyses on the original sample (see "Results" section) highlighted that honest and faking-good respondents mainly differed in their responses on the L scale, only, the out-of-sample group were only administered items on this scale. Moreover, in the second data collection, two methodological shortcomings were fixed: i) the instructions given to honest and faking-good groups were matched, so that honest were informed that the test contained features designed to detect faking; and ii) the position of the response labels (true vs. false) was inverted to eliminate possible response biases due to the allocation of these labels.
All participants provided informed consent before the research began. They did not receive any compensation for their participation. The experimental procedure was approved by the local ethics committee (Board of the Department of Human Neuroscience, Faculty of Medicine and Dentistry, Sapienza University of Rome), in accordance with the Declaration of Helsinki.

Materials. Underreporting validity scales (L, K, S) of the MMPI-2. The Minnesota Multiphasic Personality
Inventory-2 (MMPI-2) 37 is a 51-scale self-report questionnaire that is used to measure personality and psychopathology. It is comprised of 567 items that each require a dichotomous answer (true vs. false). The MMPI-2 is largely used in forensic and evaluation settings [38][39][40][41][42] . The present study used three MMPI-2 validity scales: Lie (L), Correction (K), and Superlative Self-Presentation (S). The L scale, composed of 15 items, was designed to detect the acknowledgment of uncommon virtues and the tendency to offer a more socially acceptable image of oneself (e.g., "I do not always tell the truth"). Most of the items in this scale require respondents to choose "false" in order to answer in a socially desirable way. The K scale, composed of 30 items, was designed to detect defensiveness in a more subtle way, investigating respondents' adjustment and emotional control (e.g., "criticism or scolding hurts me terribly"). The S scale, composed of 50 items, was designed to identify self-presentation as highly virtuous and extremely well adjusted in any context (e.g., "I have never felt better in my life than I do now"). The higher a respondent scores on these scales, the higher the chance that he or she is presenting an overly positive self-image. The Italian version of the MMPI-2 was edited by Pancheri and Sirigatti 43,44 .
Virtuous responding (VR) validity scale of the PPI-R. The Psychopathic Personality Inventory-Revised (PPI-R) 45 is a 154-item personality questionnaire, articulated within 8 subscales, that assesses traits associated with psychopathy. Respondents must answer each item on a 4-point scale (true vs. true enough vs. false enough vs. false). The present study used the PPI-R Virtuous Responding (VR) validity scale, which is composed of 13 items (e.g., "I've never desired to hurt someone") and was designed to detect underreporting. The Italian version of the PPI-R was edited by La Marca et al. (2008) 46 .
Research design. A mixed design was implemented. The two manipulated factors were instructions (H vs. FG) and time pressure (U vs. S). As described above, participants were randomly assigned to one of four experimental groups: H-FG/U, FG-H/U, H-FG/S, and FG-H/S.
In the first group (H-FG/U), subjects completed the tests (the L, K, and S scales of the MMPI-2; and the VR scale of the PPI-R) without time pressure, first with the instruction to respond honestly (1a) and then with the instruction to fake good (1b). Specifically, the instructions were as follows: 1a) We are interested in some characteristics of your personality. We want you to take this test in a totally sincere fashion. After reading each item you should take all the time you need to respond in the best way. 1b) You just completed the test honestly. Now imagine that you are applying for a desired job. In this situation, it would be to your advantage to appear as if you were completely normal and psychologically healthy. Stated differently, we want you to take this test and deliberately fake good. Pay attention, because the questionnaire contains features designed to detect faking, and your intent is to respond in a way that your deception cannot be detected. After reading each item you should take all the time you need to respond in the best way, according to this instruction.
In the second group (FG-H/U), subjects completed the test without time pressure, first with the instruction to fake good (2a) and then with the instruction to respond honestly (2b). Specifically, the instructions were as follows: www.nature.com/scientificreports www.nature.com/scientificreports/ 2a) We are interested in some characteristics of your personality. Imagine you are applying for a desired job. In this situation, it would be to your advantage to appear as if you were completely normal and psychologically healthy. Stated differently, we want you to take this test and deliberately fake good. Pay attention, because the questionnaire contains features designed to detect faking, and your intent is to respond in a way that your deception cannot be detected. After reading each item you should take all the time you need to respond in the best way, according to this instruction. 2b) You just completed the test dishonestly. Now, we are interested in some real characteristics of your personality. We want you to take this test in a totally sincere fashion. After reading each item you should take all the time you need to respond in the best way.
In the third group (H-FG/S), subjects completed the test with time pressure, first with the instruction to respond honestly (3a) and then with the instruction to fake good (3b). Specifically, the instructions were as follows: 3a) We are interested in some characteristics of your personality. We want you to take this test in a totally honest fashion. After reading each item you should respond as quickly as possible.
3b) You just completed the test honestly. Now imagine that you are applying for a desired job. In this situation it would be to your advantage to appear as if you were completely normal and psychologically healthy. Stated differently, we want you to take this test and deliberately fake good. Pay attention, because the questionnaire contains features designed to detect faking, and your intent is to respond in a way that your deception cannot be detected. After reading each item you should respond as quickly as possible. Short response time is an important factor in this test.
Finally, in the fourth group (FG-H/S), subjects completed the test with time pressure, first with the instruction to fake good (4a) and then with the instruction to respond honestly (4b). Specifically, the instructions were as follows: 4a) We are interested in some characteristics of your personality. Imagine you are applying for a desired job. In this situation it would be to your advantage to appear as if you were completely normal and psychologically healthy. Stated differently, we want you to take this test and deliberately fake good. Pay attention, because the questionnaire contains features designed to detect faking, and your intent is to respond in a way that your deception cannot be detected. After reading each item you should respond as quickly as possible. Short response time is an important factor in this test. 4b) You just completed the test dishonestly. Now, we are interested in some real characteristics of your personality. We want you to take this test in a totally honest fashion. After reading each item you should respond as quickly as possible. Short response time is an important factor in this test.
Procedure and stimuli. The experimental task was completed individually in a neutral, quiet room in the Human Neuroscience Department of Sapienza, University of Rome. Subjects, placed approximately 60 cm from the screen, completed the test on a 15-inch display laptop with a Microsoft Windows operating system. After the initial reception, participants went through the following procedure: a) completion of a consent form, b) completion of a demographic questionnaire, c) assignment to one of the four experimental groups previously described, d) completion of the experimental task (scripts L, K, S, and VR) with their respective group's first instructions (the abovementioned instructions 1a, 2a, 3a, and 4a), e) projection of an unrelated short video, and f) completion of the experimental task (scripts L, K, S, and VR) with their respective group's second instructions (the abovementioned instructions 1b, 2b, 3b, and 4b) (see Table 1).
The experimental task consisted of the 96 stimuli (i.e., items) belonging to the underreporting scales (L, K, S) of the MMPI-2 and the VR scale of the PPI-R (see Table S1). The presentation order of the stimuli reflected the item appearance order in the MMPI-2 protocol, followed by the item appearance order of the VR scale in the PPI-R. Stimuli were presented in the central display of the computer screen. Participants had to initiate the presentation of each question by clicking (with the mouse) a START button located in the centre-lower part of the screen. For items relating to the MMPI-2 validity scales, participants were asked to respond to each question by clicking (with the mouse) one of two alternative response buttons (TRUE vs. FALSE) presented in the upper part of the computer screen: one in the upper-left corner and one in the upper-right corner (see Fig. 1). For items relating to the VR scale of the PPI-R, participants had to choose one of four alternative response buttons (TRUE vs. TRUE ENOUGH vs. FALSE ENOUGH vs. FALSE).
Collected measures. During the experimental task, the MouseTracker software 25 automatically recorded a number of features relating to the response of the mouse in spatial and temporal terms. Mouse parameters that the literature reported to be the most sensitive to deception detection were collected 5  The idealized trajectory represented the virtual straight line connecting the starting point to the endpoint (the response label). For example, if the start button (placed in the centre-lower part of the screen) and the response labels (placed in the upper-left and upper-right corners) formed a triangle, the idealized response trajectory would correspond to the side of the triangle that connected the START button to each response label. Because the recorded trajectories had different lengths, each motor response was time normalized in order to permit trials to be averaged and compared. Using linear interpolation, the software calculated time normalization in 101 temporal frames. As a result, each trajectory had 101 temporal frames and each time frame had corresponding X and Y coordinates 25 . Finally, for each spatial (MD, tAUC) and temporal (RT, MD-time, vel x , vel y ) parameter, the average response value on each scale (L, K, S, VR) was computed, generating 24 variables (see Table S2). The T-scores on the underreporting scales (L, K, S) of the MMPI-2 and the VR scale of the PPI-R were computed. The term T scores is used to denote "test scores that -within rounding errors-have a mean of 50 and a standard deviation of 10 in the normal group" 48 . It is calculated by using the following linear transformation: T = 50 + 10 (Xi-x)/s in which Xi is the raw score to be converted, x is the mean, and s is the standard deviation of the norm group 49 .
All measures, conditions, data exclusions, and methods used to determine the sample sizes are reported here.
Univariate statistical analysis. Mixed ANOVA models were used to test the six hypotheses (H1-H6) on the original sample of 120 participants. In more detail, comparisons were drawn between the performances obtained by the four experimental groups on each scale (L, K, S, VR), in terms of T-scores (H1 and H2), temporal features (H3 and H4), and spatial features (H5 and H6). Means and standard deviations of all dependent variables are shown in Table S3. The effect sizes of the score differences between groups were recorded; with respect to magnitude, η G 2 = 0.02 was considered indicative of a small effect, η G 2 = 0.13 a medium effect, and η G 2 = 0.26 a large effect 50 . All analyses were performed using the "ez" package in the R software 51 .

H1 and H2.
A mixed ANOVA was computed on the T-scores of each scale (L, K, S, VR), and the results demonstrated a significant effect of instructions on each scale. In other words, faking-good respondents obtained significantly higher T-scores on the L, K, S, and VR scales relative to honest respondents. Table 2 reports the ANOVA outputs that highlight the statistically significant results.  www.nature.com/scientificreports www.nature.com/scientificreports/ H3 and H4. Mouse movements were temporally described by four kinematic features: RT, MD-time, vel x , and vel y . For each feature, a mixed ANOVA was run to compare the temporal responses of the four experimental groups on the L, K, S, and VR scales. To resolve the multiple testing problem, the Bonferroni correction was applied, dividing the p-value by the number of tested features and setting the significance level to 0.0125 52 . Table 3 reports the output of the temporal features that showed statistically significant effects. It is worth noting that   www.nature.com/scientificreports www.nature.com/scientificreports/ when responding to items on the L, K, and S scales. They were also faster than faking-good respondents in moving along the y-axis (vel y ) when responding to items on the K and VR scales. H5 and H6. The shape of the mouse trajectories was described by two spatial features: MD and tAUC. Similar to the analysis of temporal features, a mixed ANOVA was run to compare the mouse trajectories of the four experimental groups on the L, K, S, and VR scales. After the Bonferroni correction, the significance level was set to 0.025. The significant outputs are reported in Table 4. The results demonstrate a main effect of instructions on MD and tAUC for the L scale, only. In other words, faking-good respondents had wider trajectories than honest respondents on the L scale (see Fig. 3    www.nature.com/scientificreports www.nature.com/scientificreports/ Classification models. To consider the contribution of all dependent variables in a single statistical model and to further investigate the accuracy of mouse tracking parameters in detecting faking-good behaviours when responding to personality inventories, several classification models were built using machine learning (ML) techniques. ML approach is useful when the focus of the analysis is prediction instead of explanation 53 . Moreover, learning algorithms allow to find patterns in highly complex datasets: they can be effective also in the presence of complicated non-linear interactions 54 . Using ML approach, it is possible to build very complex models (e.g., considering a large number of variables), which are difficult to build with traditional statistical methods 55 . Furthermore, model evaluation techniques (e.g., k-fold cross validation) are intended to guarantee that the reported results are not overly optimistic. These machine learning models were implemented using the data mining software WEKA 3.9 56 . Model accuracy was evaluated using a 10-fold cross-validation procedure 57 , which consisted of repeatedly partitioning the original sample into a training set to train the model, and a validation set to evaluate it. The original sample of 120 participants who performed the task twice (honest vs. faking-good) was randomly partitioned into 10 equal-size subsamples, or folds (10 folds of 24 tasks). Of the 10 subsamples, data from a single subsample was retained as validation data for testing the model, and the remaining 9 subsamples were used to generate training data. This process was repeated 10 times, with each of the 10 folds used exactly once as validation data. The results of the 10 folds were then averaged to produce a single estimation of prediction accuracy. For each model, accuracy, recall, precision, F-measure, and ROC area were reported. Finally, the weight of each variable (predictor) was examined by measuring the correlation (r pb ) between each variable and the outcome (honest vs. faking-good).
In a second phase, to test the generalization of the classifiers' performance on completely new data, some of the models were tested on the additional sample of 120 participants who had been recruited as an out-of-sample    www.nature.com/scientificreports www.nature.com/scientificreports/ evaluation group. Because the models had been built to fit the original data, it was important to test how they would fit new and unseen data 58 . The new group of participants (the test set) was collected after the models were built, so the subjects had never been seen by the ML classifiers.
As stated above, classification accuracy was evaluated using ML algorithms. Specifically, these algorithms investigated whether the results were stable across classifiers or whether they depended on specific model assumptions. In fact, the algorithms that were chosen were representative of different underlying classification strategies, including regression, trees, and Bayesian statistics (i.e., logistic 59 , support vector machine (SVM) 60 , naïve Bayes 61 , and random forest 62 classifiers). For example, naïve Bayes is a probabilistic classifier inspired by the Bayes theorem. Naïve Bayes is a conditional probability model 51 : given a problem instance to be classified, represented by a vector x = (x 1 , …, x n ) representing some n features (independent variables), it assigns to this instance probabilities p(C k | x 1 , …, x n ) for each of K possible outcomes or classes C k .

ML model evaluation on all predictors.
Using a 10-fold cross-validation procedure, the classification models were first built and evaluated on all predictors (i.e. L, K, S, and VR T-scores; L, K, S, and VR MD; L, K, S and VR tAUC; L, K, S, and VR RTs; L, K, S, and VR MD-time; L, K, S, and VR vel x ; L, K, S, and VR vel y ). The results (see Table 6) demonstrated stable classification accuracy across all classifiers, ranging from 76.25% to 80%, with the random forest classifier demonstrating the best performance (80%).
Analysis of the weight of predictors revealed that the variables most correlated with the outcome were the following: T-score S scale (r pb = 0.587), T-score VR scale (r pb = 0.574), T-score L scale (r pb = 0.564), vel x L scale (r pb = 0.564), T-score K scale (r pb = 0.545), vel x S scale (r pb = 0.528), vel x K scale (r pb = 0.519), and RT L scale (r pb = 0.203). For all other variables, r pb value was less than 0.2. Moreover, using logistic regression we calculated the AUC value of the ROC curve for each independent variable, obtaining the following results: T-score S scale AUC ML model evaluation on L scale predictors, only. As the univariate statistical analysis highlighted that honest and faking-good respondents mainly differed on temporal and spatial parameters on the L scale, only, a second set of ML models was evaluated using only L scale predictors: T-score L scale, RT L scale, MD-time L scale, vel x L scale, vel y L scale, MD L scale, and tAUC L scale. This was also the rationale for administering only the L scale items to the out-of-sample group.
The results are reported in Table 7, for both the 10-fold cross-validation and the test on the out-of-sample group. Classifiers showed accuracies ranging from 72.5% to 75.42% in the cross-validation, whereas they ranged from 78.75% to 81.67% in the out-of-sample group. These results demonstrate that: i) accuracies were stable across all classifiers; and ii) the models showed good generalization on completely new data, as the accuracies on the out-of-sample group outperformed those of the training models. With respect to the weight of predictors, T-score, vel x , and RT provided the most significant contributions to the model, while the others only fine-tuned the already good classification.

Discussion
The main aim of the present research was to explore whether kinematic indicators could improve the detection of subjects implementing faking-good behaviour when answering personality inventories, with and without time pressure.

T-scores on the MMPI-2 underreporting scales (L, K, S) and the PPI-R VR scale. The results
supported the first hypothesis (H1), according to which T-scores on the underreporting scales (L, K, S) of the MMPI-2 and the VR scale of the PPI-R were expected to be higher in the faking-good condition compared to the honest condition. Indeed, on all scales, T-scores were significantly higher in the faking-good condition compared to the honest one. This finding is in line with the results of previous studies, indicating that fakers obtain high scores on MMPI overreporting scales 63 . In this sense, it is not a startling result, as it simply reflects the fact  Table 6. Results from the ML models evaluated on the entire set of predictors. For each classifier, the following metrics obtained by the 10-fold cross-validation procedure are reported: accuracy, precision, recall, F-measure, and ROC area. (2020) 10:4835 | https://doi.org/10.1038/s41598-020-61636-5 www.nature.com/scientificreports www.nature.com/scientificreports/ that the study instructions were correctly understood by participants: subjects instructed to fake good presented themselves in a more positive way by selecting the socially desirable alternative.
Moreover, contrary to expectations, the second hypothesis (H2), which expected T-scores on the selected underreporting scales to be higher in the faking-good speeded condition compared to the faking-good unspeeded condition and T-scores of honest respondents to not show any significant difference between the speeded and unspeeded conditions, found only partial support. Neither faking-good nor honest respondents, in fact, showed any statistically significant difference between the speeded and unspeeded conditions, in terms of T-scores. On the one hand, honest respondents remained honest under the speeded condition, indicating that they were not affected by time pressure; on the other hand, faking-good respondents in the speeded condition did not show the expected significant increase in T-scores. This result does not exactly agree with the findings of previous studies 17,22,23 , demonstrating that a speeded condition induces fakers to significantly improve their self-presentation (as demonstrated by increased T-scores) relative to an unspeeded condition, on both the MMPI-2-RF L-r and K-r scales 17 and the MMPI-2 L and K scales 23 . The authors of these studies suggested that time pressure may limit respondents' ability to consider the appropriateness of endorsing particularly virtuous items, and this may lead them to enhance their positive self-presentation and subsequently present less believable profiles. In this study, however, the T-scores of faking-good respondents were higher in the speeded condition than the unspeeded condition, on all scales-albeit not significantly. This lack of significance could be ascribed to the order in which subjects in the third experimental group (H-FG/S) completed the tests: since the administrated questionnaires were comprised of the same items, participants who first completed them honestly and then completed them under faking-good instructions may have been biased by a learning effect. Specifically, these respondents might have remembered the content of some items from the first administration, and this knowledge might have interfered with the effect of time pressure that has otherwise been observed in the literature. In other words, fakers take longer to respond because they must first identify the answer that could provide the most socially desirable image of themselves and choose this response over a true evaluation of their personality characteristics and mental functioning. Time is also necessary for a third evaluation-one that serves to estimate whether a particular answer will appear "too fake" and should subsequently be discarded for fear of discovery. Carrying out this triple evaluation-relating the questionnaire item to one's own person, identifying the most socially desirable answer, and identifying whether the question might moderate faking-good behaviour-requires time. Temporal pressure reduces the available evaluation time and likely leads fakers to omit the last step of the decision process, making their faking behaviour more easily discovered. However, in the present study, faking-good respondents who already knew the item content because they had previously filled it in honestly (i.e., group 3, H-FG/S) may have been able to save sufficient time to carry out all three of the evaluation steps and therefore lie in a less detectable way. Other subjects, who knew the contents of the questions when filling out the questionnaire for a second time but were instructed to do so honestly (i.e., group 4, FG-H/S) would not have altered their answer: indeed, the response chosen (true vs. false/true vs. true enough vs. false enough vs. false) was used to calculate the T-scores for the scales.

Differences in mouse movements and trajectories between honest respondents and fakers.
In the present study, mouse dynamics were used for the first time to investigate faking-good behaviour with respect to the validity scales of two personality questionnaires (the MMPI-2 and PPI-R). In the literature, mouse dynamics have been shown to provide useful behavioural cues to identify deception 33,64 , and the technique has already been successfully applied to detect faking-bad respondents 5,47 . In the present research, only for the L scale the results were consistent with the findings reported in previous studies, which have shown that, compared to honest participants, fakers take more time to respond to stimuli 18 and outline wider trajectories when selecting a response 5 , albeit with small observed effects. Only in relation to the L scale, indeed, the results supported the fourth hypothesis (H4); on this scale, mouse movements were slower in the faking-good condition relative to the honest condition, since faking-good respondents spent more time than honest respondents in responding. In fact, fakers showed significantly slower RTs and MD-times only on the L scale. Furthermore, only on the L scale, faking-good respondents showed wider mouse trajectories than honest respondents (see MD and tAUC www.nature.com/scientificreports www.nature.com/scientificreports/ parameters) with very small effects. The fifth hypothesis (H5), which expected faking-good respondents' mouse trajectories to be wider than those of honest respondents, was only partially supported.
It is worth noting that the L scale demonstrated the most sensitivity in distinguishing between honest and faking-good respondents on the basis of both temporal and spatial parameters. This higher sensitivity may be due to the fact that this scale has particularly obvious items, with content pertaining to weaknesses and minor flaws that are observable in everyday life situations (e.g., "At times I feel like swearing, " "I do not always tell the truth"). Conversely, on the K scale, which is less transparent and focuses on more complex behaviours, honest respondents may require more time to choose the alternative that they feel best describes them. In the present study, the PPI-R underreporting scale VR was the least sensitive in differentiating honest from faking-good respondents. The lower number of significant scores in the VR statistics may be linked to a feature of the PPI-R test, itself: contrary to the MMPI-2, which offers dichotomous choices (TRUE vs. FALSE), the PPI-R offers four alternatives (TRUE, TRUE ENOUGH, FALSE ENOUGH, FALSE). In line with a previous study by Kiesler (1966), in which subjects selecting from four alternatives took longer than those selecting between two alternatives, the VR scale may have presented both honest and faking-good respondents with a more cognitively demanding decision-making process in both the speeded and unspeeded conditions. Future research should aim at uncovering the influence of the number of alternatives on test items by using exclusively underreporting scales offering four alternatives.
Effect of time pressure on mouse movements and trajectories. Besides exploring the possibility of using mouse trajectories to detect faking-good behaviour, the present study also analysed the effect of time pressure on mouse movements. Results showed that participants in the speeded condition responded to all MMPI-2 items faster than those in the unspeeded condition (see RT and MD-time parameters). This result primarily reflects the fact that the instructions given to participants were effective in creating time pressure on subjects assigned to the speeded experimental condition, giving support for the third hypothesis (H3), which expected mouse movements to be faster in the speeded condition relative to the unspeeded condition. Furthermore, the result suggests that time pressure has an effect on the temporal parameters measured via mouse tracking that, as previously explained, are not necessarily the same as those measured by previous studies (which have typically used RTs). A more interesting finding is that time pressure affected the shape of the trajectories outlined by participants. Indeed, the statistical analysis found support for the sixth hypothesis (H6), which expected mouse trajectories for subjects in the speeded condition to be wider than those in the unspeeded condition. Specifically, MD was particularly sensitive to time pressure, with participants in the speeded condition showing greater MD than those in the unspeeded condition, even if with small effects. This result seems to corroborate the hypothesis that it is difficult to control more than one movement parameter, so the performance of participants decreases in spatial terms when time is limited.
Finally, as no significant interaction was found between the time pressure and instruction conditions, it can be concluded that time pressure is not a critical factor for the detection of faking-good behaviour when kinematic parameters are available.
predictive value of the technique. To investigate the predictive value of the abovementioned variables (T-scores and mouse dynamics) in detecting faking-good respondents in relation to the validity scales of personality questionnaires (i.e., the MMPI-2 L, K, and S scales and the PPI-2 VR scale), ML models were trained and validated. The results can be summarized as follows: • The models achieved accuracies ranging from 72% to 80%.
• Different ML algorithms based on different classification strategies produced similar accuracies, confirming that the results were not dependent on the model assumptions. • The most significant contribution to prediction accuracy was provided by T-scores and velocity along the x-axis; this was true for all scales, except for vel x of the VR scale. RT of the L scale seemed to fine-tune the models. • Entering only L scale predictors in the models produced similar results as entering all variables as predictors.
• The models that were built on the original sample performed with similar accuracy when tested on the outof-sample, showing very good generalization to new data.
To conclude, this exploratory study suggests that some parameters of mouse dynamics-especially velocity on the x-axis-could be useful for detecting subjects who fake good when completing the validity scales of the MMPI-2 and PPI-R personality questionnaires, independent of whether a time pressure condition is imposed. However, upon comparing the accuracy performance obtained in this study (72-80%) with the accuracies reported in previous studies, it seems that mouse parameters may be less accurate than simple RT analysis (with reported accuracy ranging from 75-95%) 22 for this task. The present findings are still preliminary and confirmatory studies are needed. Future research would benefit from studies situated in an ecological context; for example, studies with actual job applicants or child custody litigants. This would help to achieve generalizability with previous results obtained with instructed participants and experimental/manipulated designs, with the aim of including behavioural features for faking detection in personnel and forensic real-life settings 55 . Moreover, future studies could focus on improving converging validity by applying additional behavioural and implicit parameters and measuring these with eye-tracking 65 and face-reading techniques 66 .

Data availability
The dataset that was generated and analyzed for the current study is publicly available, along with the source code of the task, the analysis code, and the instructions: https://doi.org/10.5281/zenodo.3529450.