Using heterogeneous sources of data and interpretability of prediction models to explain the characteristics of careless respondents in survey data

Kopitar, Leon; Stiglic, Gregor

doi:10.1038/s41598-023-40209-2

Download PDF

Article
Open access
Published: 17 August 2023

Using heterogeneous sources of data and interpretability of prediction models to explain the characteristics of careless respondents in survey data

Leon Kopitar^1,2 &
Gregor Stiglic^1,2,3

Scientific Reports volume 13, Article number: 13417 (2023) Cite this article

673 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Prior to further processing, completed questionnaires must be screened for the presence of careless respondents. Different people will respond to surveys in different ways. Some take the easy path and fill out the survey carelessly. The proportion of careless respondents determines the survey’s quality. As a result, identifying careless respondents is critical for the quality of obtained results. This study aims to explore the characteristics of careless respondents in survey data and evaluate the predictive power and interpretability of different types of data and indices of careless responding. The research question focuses on understanding the behavior of careless respondents and determining the effectiveness of various data sources in predicting their responses. Data from a three-month web-based survey on participants’ personality traits such as honesty-humility, emotionality, extraversion, agreeableness, conscientiousness and openness to experience was used in this study. Data for this study was taken from Schroeders et al.. The gradient boosting machine-based prediction model uses data from the answers, time spent for answering, demographic information on the respondents as well as some indices of careless responding from all three types of data. Prediction models were evaluated with tenfold cross-validation repeated a hundred times. Prediction models were compared based on balanced accuracy. Models’ explanations were provided with Shapley values. Compared with existing work, data fusion from multiple types of information had no noticeable effect on the performance of the gradient boosting machine model. Variables such as “I would never take a bribe, even if it was a lot”, average longstring, and total intra-individual response variability were found to be useful in distinguishing careless respondents. However, variables like “I would be tempted to use counterfeit money if I could get away with it” and intra-individual response variability of the first section of a survey showed limited effectiveness. Additionally, this study indicated that, whereas the psychometric synonym score has an immediate effect and is designed with the goal of identifying careless respondents when combined with other variables, it is not necessarily the optimal choice for fitting a gradient boosting machine model.

The quest for the reliability of machine learning models in binary classification on tabular data

Article Open access 27 October 2023

Identifying the roots of inequality of opportunity in South Korea by application of algorithmic approaches

Article Open access 12 January 2022

A prediction-focused approach to personality modeling

Article Open access 25 July 2022

Introduction

In the modern era, due to the hectic lifestyle of most people, surveys provide an easy and rapid way of collecting data that serves as feedback without any need for direct human-to-human interaction. Although conducting a survey is a simple solution to data collection in many cases, it also brings some drawbacks. Respondents with different backgrounds and mentalities will respond to surveys differently. Some recognize the importance of completing the survey and addressing each item, others simply take an easy road and complete it in a careless manner. Careless responding is categorized into several types: random respondents that respond at random choice, midpoint respondents who hold the higher probabilities for selecting middle categories, and fixed pattern respondents (e.g. 1, 2, 1, 2...)^1,2.

The proportion of so-called careless respondents determines the quality of collected survey data. As reported by Credé, careless responding rates as low as 5% have the potential to significantly affect observed correlations compared to important study artifacts such as range restriction, dichotomization of continuous variables, and score unreliability³. As a result, the identification of careless respondents is essential for further improvements in this field of research.

Previous studies proposed various methods to detect careless respondents. In 1976, according to a study by Johnson, Jackson recommended an index, now known as the even-odd consistency score⁴. The even-odd consistency examines the relationship between scores obtained from the odd and even sections within different subscales⁴. Maniaci and Rogge reported that a cutoff value of less than 0.3 indicates careless responding⁵. Johnson also stated that Goldberg suggested a method based on psychometric antonyms , which are item pairs that are correlated highly negatively⁴. Based on that method, Meade and Craig developed the psychometric synonyms index , which on the other hand, are item pairs that are correlated highly positively¹. Maniaci and Rogge observed that the most substantial average increase in power was achieved when the threshold values for indices of careless responding, specifically psychometric antonyms and psychometric synonyms, were below $-0.65$ and $-0.03$, respectively⁵. In the same study, they reported that the suggested cutoff value for psychometric antonyms ($<{-0.03}$^4,6) resulted in 4–9% drop in power. According to the study of Nielsen et al., the even-odd consistency score and psychometric indexes are of limited usefulness when it comes to questionnaires with a small number of items and scales (e.g. thirty subscales)⁷.

Another index, the longstring index, was introduced in the study conducted by Johnson⁴. It is an extended sequence of continuous and identical answers provided by an individual. While Maniaci and Rogge reported that the longstring index of more than 7 is a great cutoff value for detecting careless responding⁵, Johnson points out that the longstring index is sensitive to responses that exhibit extreme consistency and cutoff value is difficult to determine⁴. Costa and McCrae offered recommendations for the maximum length of longstrings using the NEO-PI-R as a basis⁸. Another indices of careless responding, Mahalanobis distance has been demonstrated to be efficient at detecting careless respondents⁹. It assumes that responses significantly deviating from the sample norm (resulting in larger Mahalanobis distance) may indicate careless responding. The drawback is that it demonstrates efficiency solely when generating genuinely random responses¹. In 2018, Dunn et al.¹⁰ introduced an indicator named intra-individual response variability and other more advanced methods, such as the systematic approach in a method named floodlight detection for careless respondents¹¹. The drawbacks of intra-individual response variability include the necessity to calculate it across multiple constructs and reversely coded items. Furthermore, the presence of both low and high variability could potentially indicate careless responding^2,10,12. Goldammer et al.¹³ found response time per item, personal reliability, psychometric synonyms/antonyms and Mahalanobis distance to be effective methods for detecting carelessness, on the contrary, longstring and intra-individual response variability were not significantly related to detection of careless respondents. In 2022, Wind and Wang¹⁴ conducted a study using Mokken scale analysis to detect carelessness in the survey, where they showed the robustness of Mokken scale analysis indicators of item quality to the presence of careless respondents. It is unknown how Mokken scale analysis performs on data with missing responses and to carelessness patterns other than random responses and overly consistent responding¹⁴. Arias et al. employed a factor mixture model specifically created to identify discrepancies in the way individuals respond to items that have varying semantic polarity¹⁵. Another study put forth a model based on item response theory that aims to identify and model careless responding at a detailed level, considering both the respondent and the specific item. Their model has the capability to detect various patterns of careless responding and contributes to a more comprehensive understanding of the item characteristics associated with its occurrence¹⁶. Ulitzsch et al. introduced a model-based approach that utilizes response time data from computer-administered questionnaires to simultaneously detect different manifestations of careless responding¹⁷. Their approach considers the characteristics of attentive response behavior on questionnaires by incorporating the distance-difficulty hypothesis. It acknowledges that attentiveness can vary at the screen-by-respondent level and accommodates individuals with different traits who may exhibit distinct levels of attentiveness. Simultaneously addresses a wide range of response patterns that emerge due to careless responding¹⁷.

Speaking of interpretability, Effrosynidis and Arampatzis¹⁸ compared 12 variable selection methods and showed that ensemble-based method Reciprocal Ranking is the most effective whilst the best individual method appeared to be SHapley Additive exPlanations (SHAP). SHAP is a method used to provide explanations for the decision-making process of models, specifically focusing on understanding sample-level decisions¹⁹. Interpretability of the decisions made by the prediction model plays a major role in understanding machine learning (ML) algorithms and can be presented in various ways. To date, well-known approaches such as global interpretability and local interpretability exist and have been applied in several studies^{20,21,22,23,24}, as well as alternative techniques like model-specific and model-agnostic approaches²⁵. While global interpretability focuses on decisions made at a population level, local interpretability takes it even further by emphasizing decisions occurring on an individual level²⁶. Since this article focuses on eliminating careless respondents where we focus on the characteristics of an individual, the latter approach is also applicable to our study.

In terms of the interpretability of the method, Liu et al.²⁷ proposed a framework to predict and interpret patient satisfaction with Random Forest and local explanation method. However, we have not encountered any studies that include local interpretability of the methods dealing with the detection of careless respondents.

Schroeders et al.² introduced a novel gradient boosting machine (GBM) model response time-based approach to identify careless respondents. This study compared the proposed model against traditional methods for identifying careless respondents.

The purpose of this paper was to examine prediction model interpretability techniques to evaluate decisions made by the proposed model at the single participant level and to identify factors that mislead the model, which consequently results in misclassifications.

Materials and methods

The data used in our study is publicly accessible through the Open Science Framework (OSF) repository²⁸ , the link is provided in the “Data availability” section. It was collected by Schroeders et al.² as part of their study on the detection of careless respondents. More specifically, they conducted a 3-month web-based survey study in the first third of 2020. Participants were randomly assigned into two groups. According to the study Schroeders et al.², in the first group, participants answered given questions after carefully considering all given options (regular respondents), the participants of the second group were required to respond to the same questions in a speedy, careless manner (careless respondents). Therefore, the type of careless responding in this study refers to participants who were instructed to respond to survey questions in a speedy and careless manner (For more see Schroeders et al.²).

Simultaneously, authors were collecting demographic data (such as age, gender, profession and similar) and data that examined participants’ personality traits such as honesty-humility, emotionality, extraversion, agreeableness, conscientiousness and openness to experience. Authors additionally tracked and stored response times on each section of the questionnaire (six personality trait sections of 10 items).

In this study, we examined the option of using heterogeneous data to explain the characteristics of careless respondents. For that purpose, we compared the performance of the prediction models built under the following scenarios:

Raw data as the responses (some of them reversed) to the questionnaire questions (resp, 60 variables)
Data that provide information about the time of answering each section of questions (rt, six variables)
Only indices for careless responding (Careless, 13 variables)
Demographic data (dem, four variables)
Combination pairs of resp, rt and dem (resp_dem, rt_dem, resp_rt)
Data consisting of all three sources of data in a single dataset (all)
Extracted data representing different indices of careless responding (data calculated using data-driven detection mechanisms to detect careless respondents), including data consisting of all three types of data (all_extracted)

In the study by Schroeders et al.², the highest average balanced accuracy was achieved by the GBM prediction model built on a combination pair resp_rt (0.66 ± 0.06). The same combination pair resp_rt was used in model training and represented a baseline model of the first part of this study. In comparison with the original study by Schroeders et al., we also used extracted indices of careless responding (i.e. variables calculated from the available raw data by using the established indices of careless responding to detect careless respondents). It needs to be noted that the original study mentions the possibility of using indices of careless responding to build prediction models, but they also report that prediction performance gains were insignificant in the case of using additional indices of careless responding.

However, in this study, we use prediction model interpretability approaches to analyze to what extent the indices of careless responding can help us understand what the characteristics of the careless respondents are as well as why some predictions of the prediction model are wrong.

Experimental setup

Initially, the source code for data cleaning was obtained from the repository proposed in the study conducted by Schroeders et al.². Additionally, we added lines of code to create a subset of demographic data, a subset that is comprised of a combination of response time and demographic data and a subset with responses and demographic data. Similarly, a dataset with all variables and additional indices of careless responding was created.

According to Schroeders et al., the online survey was conducted using the SoSci-Survey tool from February 2020 to April 2020². Schroeders et al. reported that a total of 605 respondents took part in the test, with 361 participants under normal conditions and 244 participants under conditions where they were not paying careful attention. Among the respondents, approximately two-thirds were female, while around one-third were male, and a small percentage identified as diverse. The average age of the participants was 43.1 years (+ 17.8). The composition of the sample, combining both regular and careless conditions, consisted of 28.1% students, 2.8% manual workers, 39.3% employees, 5% self-employed individuals, 16.5% retired individuals, and 8.3% belonging to other categories².

The GBM model was evaluated using tenfold cross-validation repeated a hundred times, each iteration was performed with a different seed number, which assures the reproducibility of results. Training (n = 420) and test (n = 185) set were built by sampling without replacement each time with a different seed number. During the validation process, the following hyper-parameters were evaluated: interaction depth, the minimum number of observations in trees at the leaf level, the number of trees and shrinkage learning rate. Performance was displayed as a balanced accuracy. For the purpose of the case study, the GBM model was trained on a subset consisting of all data and indices of careless responding (all_extracted) with the seed number set to one. Model fit was stored, as well as training and test set. Dalex explainer was trained on the training set and later utilized within a supplementary web application.

Evaluation metrics

The following evaluation metrics were included in this study: balanced accuracy for comparing prediction models and others such as the area under the curve (AUC), F1-score, sensitivity, and specificity that are included in the web application. Majority of them can be calculated through a confussion matrix (Table 1).

Table 1 Confussion matrix.

Full size table

Sensitivity is a metric that measures the proportion of actual positive samples (true positive—TP) that are correctly predicted by the prediction model.

$$\begin{aligned} Sensitivity=\frac{TP}{TP+FN}. \end{aligned}$$

On the contrary, specificity is a metric that measures the proportion of actual negative samples (true negative—TN) that are correctly predicted.

$$\begin{aligned} Specificity=\frac{TN}{TN+FP}. \end{aligned}$$

Balanced accuracy is the metric that averages sensitivity and specificity.

$$\begin{aligned} \text {Balanced accuracy}=\frac{Sensitivity+Specificity}{2} \end{aligned}$$

F1-score is a metric based on precision and sensitivity.

$$\begin{aligned} F1=\frac{2*Precision}{Precision+Sensitivity}. \end{aligned}$$

The area under the curve is calculated as the area below the Receiver Operating Characteristic (ROC) curve. The ROC curve is displayed on a two-dimensional graph, where data points are determined by sensitivity (y-axis) and (1 − specificity) (x-axis) for all possible cutoff values²⁹.

Gradient boosting machine

Gradient boosting machine (GBM) is an ensemble machine learning method that combines the knowledge of several weak prediction models. The ensemble of models is built sequentially, where each subsequent model corrects the mistakes made by previous models, therefore, minimizes a loss function^30,31.

Indices of careless responding

Package that deals with indices of careless responding, known as Careless package³² for statistical language R, was used to calculate the indices of careless responding needed for generating all_extracted dataset. This library provides different data-driven detection mechanisms to detect careless respondents (noted as indices of careless responding in this paper): psychometric synonym (res_psycsyn), psychometric antonym (res_psychant), longstring (str), average longstring (avgstr), intra-individual response variability (irv), Mahalanobis distance (res_mahad) and Even-Odd consistency index (res_evenodd).

The following indices of careless responding were used as additional variables in the dataset: psychometric synonym (psychsyn) score is a measure of strongly positively correlated item pairs. Each respondent’s score is calculated as the within-person correlation between corresponding item pairs. In contrast, psychometric antonym (psychant) score is a measure of strongly negatively correlated item pairs, whereas the score at the respondent’s level is therefore determined in an identical way¹. Longstring, also known as longstring index, represents several consecutive identical responses made by a respondent. Its variation, average longstring, represents an average number of consecutive identical responses for each respondent⁴. Another metric, intra-individual response variability (irv) is a measured standard deviation of consecutive respondent’s responses across all item responses. It is considered as an extension of longstring index¹⁰. Mahalanobis distance (res_mahad) measures “multivariate distance between respondent’s response vector and the vector of sample means”¹. Even-Odd consistency index (res_evenodd) is a measure that yields within-person product-moment correlation between even numbered and odd numbered half scale scores among all scores⁴.

Shapley additive explanations

In this subsection, we describe a prediction model interpretability technique that was used to obtain an explanation of the model predictions.

Shapley additive explanations, also known as SHAP, is a technique to explain models’ decisions on an individual sample level. It applies principles of game theory to compute Shapley values which measure the variables’ influence on the model’s prediction. Shapley value is the result of averaging the variable’s marginal contributions to every possible sample prediction and is determined with the following formulation^19,33:

$$\begin{aligned} \phi _i=\sum _{S\subseteq F\backslash \{i\}}^{} \frac{|S|!(|F|-|S|-1)!}{|F|!}[f_{S\cup \{i\}}(x_{S\cup \{i\}})-f_S(X_S)], \end{aligned}$$

where F is the set of all variables, $f_{S\cup \{i\}}$ is the model trained with the variable $i$ included, $f_S$ is the model with the variable withheld and $x_S$ are the values of the variables within the subset $S$. For a detailed explanation, please check the publication of Lundberg et al.³⁴. A higher positive shapley value suggests that a variable has greater importance in determining a positive outcome, while a higher negative shapley value indicates greater importance in determining a negative outcome. When a shapley value is equal to zero, it means that this variable does not contribute anything or is not directly connected to the decision³⁵.

Supplementary web application

Additionally, we built a Shiny application³⁶ that allows comparison of the proposed GBM model, built on data consisting of responses, demographic information, and Indices of careless responding. In a separate tab of the web application, we offer an insight into the local interpretability of the GBM model built on all_extracted data, including response times, as well as other indices of careless responding.

In a comparison, GBM performance is compared against the following indices of careless responding: psychometric synonym score (psychsyn), psychometric antonym score (psychant), longstring and longstring (Average). Optionally, we added several decision thresholds that present boundaries between predicted careless respondents and regular respondents. These thresholds can be interactively modified for each careless metric to obtain the highest possible performance. Performance can be compared with evaluation metrics such as an area under a curve (auc), F1-score (F1), sensitivity (sens) and specificity (spec).

Moreover, we provided additional explanations of decisions that were made by the GBM model. Web application (second tab) offers a visualization of the SHAP (Shapely) values which provide an overview of local interpretability³⁴. Local interpretability can be displayed per each response on demand. We included a filtering option that narrows response selection of responses that were either correctly predicted careless responses (true positives—TP), incorrectly predicted careless responses (false positives—FP), correctly predicted regular responses (true negatives—TN) or incorrectly predicted regular responses (false negatives—FN).

The web application is available at: https://lkopitar.shinyapps.io/CarelessGBMShap/.

Statistical analysis

All experiments were conducted in the programming language R³⁷, using three main packages: gbm^30,38, Careless³⁹, Shiny^40,41 and DALEX⁴². The classification performance of the prediction models was measured as the area under the ROC curve with corresponding standard deviations.

Quartile representation was used in the figures produced by the Dalex package (see Supplementary Figs. S1, S2) and in the figure displaying a comparison of model performance between models built on different sets of data. Dalex uses boxplots by default, offering no options to change the representation to any other type. The figure of comparison of models’ performances contains performance points, consequently, we found boxplot to be the most suitable method for presenting such information. All remaining figures are based on average contributions with provided confidence intervals (95% CI) with an emphasis on the demonstration of the variable importance.

The source code for all experiments and supplementary web application is documented and publicly available at the following URL: https://github.com/lkopitar/Careless_SHAP.

Ethics declarations

Informed Consent not needed, since the data used in this study is publicly available and previously published.

Results

Based on balanced accuracy, the GBM model built solely on response time data (0.637 ± 0.044) performed slightly better than the model built on responses alone (0.604 ± 0.042) (Fig. 1). Demographic data did not provide enough useful information for adequate prediction performance (0.518 ± 0.046) , as well as the subset of containing only indices for careless responding (0.589 ± 0.061). Furthermore, fusing demographic and response data brings no significant difference in model performance (0.601 ± 0.038) compared with the performance reached by response-only data. Similar observations can be noticed when the GBM model is built on a combination of response time and demographic data (0.631 ± 0.042) compared with response time data alone (0.637 ± 0.044). While previous pair combination models, with the addition of demographic data, did not contribute to overall performance, merging response and response time data (baseline model) apparently caused a slightly bigger, yet still insignificant shift toward higher performance (0.645 ± 0.045). Prediction models built on all types of data, including or excluding extracted information, did not result in any additional significant increase in performance (Fig. 1).

Since the performance of a baseline GBM model could not be significantly improved, we questioned whether indices of careless responding can provide any useful information and reveal any hidden patterns in decisions made by the baseline model. Due to that reason, we decided to examine the explainability of the prediction model built on all_extracted, which will reveal the characteristics of careless respondents in survey data.

More information on the interpretability of the careless respondents’ prediction model used in our case study can be found in the Supplementary material.

Comparison of SHAP contributions

Comparing SHAP contributions under different circumstances can reveal the contribution of variables in distinguishing careless respondents from regular respondents. Furthermore, it indicates a path for reducing the cost that false negatives provide. Three comparisons were conducted: comparison of contributions of correctly predicted careless respondents and correctly predicted regular respondents, comparison of contributions of correctly and incorrectly predicted careless respondents, and finally, comparison of Shapley values of models built on data with and without indices of careless responding.

Comparison 1: Correct decisions of GBM model among careless and regular respondents

Opposite average contribution signifies a positive contribution of the particular variable for regular respondents and the same time negative contribution of the particular for careless respondents, or the other way around. Variables with the opposite average contribution among correctly predicted careless respondents and correctly predicted regular respondents were psychometric synonym (res_psycsyn), “I would never take a bribe, even if it was a lot” (HE01_36), time spent on section 5 (time_p5) and other such as the total intra-individual response variability (irvTotal), average length of consecutive identical responses (avgstr), “I feel strong emotions when someone close to me leaves for an extended period of time.” (”HE01_47 ), time spent on section 4 (time_p4), “I prefer to do whatever comes to my mind than stick to a plan”. (reversed) (HE01_56_r), time spent on section 3 (time_p3). The largest absolute difference in average contributions was shown in res_psycsyn ($\Delta _{contribution}=0.112$), time_p5 ($\Delta _{contribution}=0.081$), HE01_36 ($\Delta _{contribution}=0.054$) and time_p3 ($\Delta _{contribution}=0.030$).

On the contrary, although variables such as time spent on section 6 (time_p6), “If I had the opportunity, I would love to attend a classical music concert.” (HE01_25) and “I would be tempted to use counterfeit money if I could be sure of getting away with it. (reversed)” (HE01_60_r) have some impact on prediction performance, including significant average difference between correctly predicted careless respondents and correctly predicted regular respondents (Figs. 2, 3), but unfortunately the same, either positive or negative contribution simultaneously. Figure 2 demonstrates comparison of variables’ positive contributions among correctly predicted careless respondents and regular respondents, whereas Fig. 3 demonstrates comparison of variables’ negative contributions among correctly predicted careless respondents and regular respondents.

Comparison 2: Correct/incorrect decisions of GBM model in careless respondents

Five the most influential variables were “I would never take a bribe, even if it was a lot” (HE01_36) (Fig. 4), time_p5, res_psycsyn, time_p4 and time_p6 (Fig. 5). Among these, only res_psycsyn and time_p4 resulted in a significant average difference in contribution to the final prediction. Variable time_p4 was related to negative contribution in the majority of cases, whereas res_psycsyn contributed positively in cases where careless respondents were correctly classified and contributed negatively in cases where careless respondents were misclassified (Fig. 5).

Comparison 3: Shapley values of the first respondent with and without indices of careless responding (all, all_extracted)

The aim of this experiment was to examine the explicability of the model where additional indices of careless responding are available to explain the predictions of the GBM model. The first respondent examined belonged to a group of regular respondents, and it was correctly predicted by the GBM model. In the use case example, we observed that it was a female respondent, 65 years old whose average response time was 65820.04 ms ($95\%$ CI 56700.90–74939.18). Throughout the entire dataset, the average response time spent on one of six sections for regular respondents was 57755.95 ms ($95\%$ CI 55776.48–59735.43), whereas for careless respondents it was approximately 14 s lower at 43271.06 ($95\%$ CI 40566.27–45975.84).

As mentioned earlier, we compared the prediction model interpretability results for this use case respondent before and after the addition of indices of careless responding. Among ten of the most influential variables, even the addition of indices of careless responding, did not replace time_p5 as the most influential variable. The most influential variable from the set of indices of careless responding was res_psycsyn overtook the position of the second most influential variable, the statement “I would be tempted to use counterfeit money if I could be sure of getting away with it. (reversed)” (HE01_60_r) (raw answers to question number 60) dropped to the seventh place, just after time_p1, time_p6, time_p4 and careless metric Mahalanobis distance (res_mahad). In addition, another marginal decrease was observed for a variable “If I had the opportunity, I would love to attend a classical music concert” (HE01_25) that dropped by two positions and was ranked as the eighth most important variable.

Generally speaking, considering the changes (before and after the addition of indices of careless responding) in average contributions of specific variables among all respondents, only time_p6, “I would be tempted to use counterfeit money if I could be sure of getting away with it. (reversed)” (HE01_60_r) and “If I had the opportunity, I would love to attend a classical music concert” (HE01_25) experienced a significant decrease in contribution, while time_p3 was the only variable whose average contribution increased (Table 2). Indices of careless responding, such as res_psycsyn and irv1 (not included in Table 2, since it is present only in all_extracted), contributed positively in favour of careless respondents, where time variables (time_p1, time_p4, time_p5, as well as time_p6 and time_p3) leaned towards the decision of regular respondents.

Table 2 Comparison of average Shapley values with and without indices of careless responding. Average contributions are provided with 95% CI (in square brackets). Variable, where significant difference is observed, are marked in bold.

Full size table

Summary of SHAP contribution comparison

The characteristics of careless respondents and the appropriateness of variables to be included in the prediction model will be assessed in the following paragraphs. The table of the appropriateness of variables for fitting a prediction model according to Shapley values is located below (Table 3). Variables were ranked with plus and minus signs, where plus sign signifies an appropriate variable and the minus signifies an inappropriate variable. Signs and the amount of signs are assigned depending on the topic of comparison, significance and level of contribution. The number of assigned signs is included within the parenthesis after the grading of each variable is explained and in Table 3. Figures of two comparisons (Comparison 1 and Comparison 2) display the top and bottom 10 variables according to the average contribution. Based on that, if a variable appears within five (top and bottom) of the most influential variables and the average difference among groups is not significant, then such a variable gets assigned three plus/minus signs. In case a variable, based on the average SHAP contribution, appears within the 5 most influential variables but the difference among groups is significant then such a variable gets assigned two signs, while in other scenarios only a single sign. The last criterion is based on the results of the average contribution of models with and without indices of careless responding (Table 2). For that criteria, a single plus sign is assigned to the variable that displays stability, and minus for a variable that characterizes instability after including indices of careless responding. Here we had chosen an inclusion criterion where the average contribution of either group should be at least 0.01, rounded on three decimal places precisely. The overall level of appropriateness is then determined by merging signs from all three comparisons together, where merging one plus and one minus sign displays the neutral level of variable appropriateness. The maximum amount of signs is limited to three. The overall level of appropriateness was not calculated for variables with only one rated criterion.

The study of Gramegna and Giudici⁴³ has shown that the use of Shapley values can provide more accurate information on variable selection. In our study, the average SHAP contributions of variables, where correct decisions of the GBM model among careless respondents and regular respondents were observed (Comparison 1), showed that the more opposite contributions of a variable are between these two groups (regular respondents and careless respondents), the more appropriate the variable might be for fitting the GBM model when it comes to distinguishing between careless respondents and regular respondents. Attributes that fall into that category are: res_psycsyn (+++), “I would never take a bribe, even if it was a lot” (HE01_36) (+++), time_p3 (+++), time_p4 (+++), time_p5 (+++), “I prefer to do whatever comes to my mind than stick to a plan (reversed)”. (HE01_56_r) (+++), avgstr (+++) and irvTotal (+++), and potentially “I feel strong emotions when someone close to me leaves for an extended period of time.” (HE01_47) (+). Variables, whose average contribution is within five most influential variables and shows significant difference between these two groups but simultaneously both contributions are either negative or positive, are “If I had the opportunity, I would love to attend a classical music concert”g variables are placed below five the most influential (top/bottom) variables: res_mahad (−), “I make a lot of mistakes because I don’t think before I act (reversed)” (”HE01_44_r) (−), “When it comes to physical dangers, I am very anxious.” (HE01_29) (−), “I plan ahead and organize so that there is no time pressure at the last minute.” (HE01_02) (−), “I would be tempted to use counterfeit money if I could be sure of getting away with it. (reversed)” (HE01_60_r) (−), irv1 (−), “I often push myself very hard when I am trying to achieve a goal.” (HE01_08) (−), “If I knew I would never get caught, I would be willing to steal a million. (”HE01_12_r) (−) and time_p2 (−).

Comparison of the average contributions of correct and incorrect decisions of the GBM model among careless respondents (Comparison 2) revealed attributes that GBM depends on, therefore, potentially distorts the robustness of the model. These variables can be recognized by comparing their average distributions among correctly predicted and incorrectly predicted careless respondents. The more opposite are contributions, the higher the chance that we might deal with an inappropriate variable. Attributes that potentially fall into that category are irv1 ($- - -$), “I would be tempted to use counterfeit money if I could be sure of getting away with it. (reversed)” (HE01_60_r) ($- - -$) (Fig. 4) and res_psycsyn ($- - -$) (Fig. 5), potentially even res_mahad (−), “I am of the opinion that I am not popular. (reversed)” (HE01_28_r) (−), “I prefer to do whatever comes to my mind than stick to a plan (reversed)”. (HE01_56_r) (−), and irvDiff (−). It is even more probable that HE01_60_r and especially irv1 are the main candidates for an inappropriate variable due to weaker (negatively oriented) average contribution displayed in comparison dealt with correctly predicted careless respondents and regular respondents (see Comparison 1). Variable “I would never take a bribe, even if it was a lot” (HE01_36) (+++), as well as “If I had the opportunity, I would love to attend a classical music concert” (HE01_25) (+++), time_p5 (+++), time_p6 (+++), avgstr (++), time_p4 (++) and time_p3 (++), less likely “When it comes to physical dangers, I am very anxious.” (HE01_29) (+), longDiff (+), HE01_02 (+), irvTotal (+), res_evenodd (+), time_p1 (+) displayed stable (average) contribution over both groups, correctly and incorrectly classified careless respondents (see Comparison 2). This leads a step closer to the confirmation of the appropriateness of the variable for fitting a model for detecting careless respondents. The explanation is that the opposite average contribution of the specific variable might present a cause for Type II Error. While res_psycsyn and HE01_56_r might be suitable for separating careless respondents and regular respondents (according to Comparison 1; see Figs. 2 and 3), it is an option that both variables increase the chances that actual careless respondents are classified as regular respondents (Comparison 2; see Figs. 4 and 5), similar logic applies also to HE01_25.

Table 3 Appropriateness of variables for fitting a prediction model according to Shapley values. Signs that are used are following: (−) innapropriate, ($+$) appropriate, (0) unsure, where levels can be: −−−, −−, −, 0, $+$, $++$ and $+++$. More signs indicate a higher chance for a variable to be appropriate/inappropriate.

Full size table

Further, the impact of indices of careless responding was evaluated. On average, the addition of indices of careless responding, specifically res_psycsyn, res_mahad and irv1, did not cause any obvious inverted changes in contributions of any existent most influential attributes (Table 2). On top of that, half of the time-related variables (time_p1 (+), time_p4 (+), time_p5 (+)) remained stable even afterward (insignificant difference in contributions), thus, if we relate to the Comparison 2 (Fig. 5), we can confirm the appropriateness of time_p5, time_p4 and potentially even time_p1.

Discussion

A prediction model based on response time performed with a higher average accuracy in comparison to models built on demographic data or responses alone. However, significant differences among them were not observed. While we expected that data fusion will boost prediction performance, evaluating models built on fused data did not bring any significant improvements.

Our study incorporates results of various approaches, including the utilization of indices of careless responding, as a variable. The distinctions and comparisons among these approaches are detailed in Schroeders et al.².

Variables such as the question “would never take a bribe, even if it was a lot” (HE01_36), “I prefer to do whatever comes to my mind than stick to a plan.” (HE01_56), majority of time-related variables, indices of careless responding such as average longstring (avgstr), total intra-individual response variability (irvTotal) were ranked as a medium to highly appropriate for inclusion. The reason for HE01_36 and HE01_56 to be ranked among these variables might be due to the content of the questions, which requires regular respondents to read and understand it at first. Time-related variables include response times, where it is expected that careless respondents proceeds faster through questions than regular respondents, even though, on times it can fail to detect since participant might lose concentration, get distracted, stops for a moment during the survey, or pauses the survey⁴⁴.

On the other hand, variables that might not bring much value to the decisions of the GBM model are “I would be tempted to use counterfeit money if I could be sure of getting away with it (reversed).” (HE01_60_r), irv1, as the section of irvTotal should be avoided, and considered intra-individual response variability instead and partially also Mahalanobis distance (res_mahad). While HE01_36 and HE01_56 present uncertainty behind the decision, HE01_60_r presents assurance and certainty of success, without the possibility of encountering additional problems. The result of ranking res_mahad with a medium level of inappropriateness is surprising since it is normally used for outlier detection⁴⁴, but apparently, other variables contribute to the GBM model in adequate quantity that the contribution of res_mahad becomes insufficient.

In the case of the first respondent, as described in Comparison 3, the addition of indices of careless responding, res_psycsyn immediately became considered as the second most contributive variable. In that given example, its value was negative and close to none (res_psycsyn=-0.1554), which indicates a low within-person correlation between the identified item pairs. Accordingly, this index value suggests it goes for regular respondents, but SHAP recognized it as a variable that increases the chance of being a careless respondent (due to positive contribution). While the same variable is considered as one of the most influential variables as observed from the comparison of average contributions among correctly predicted careless respondents and regular respondents (Comparison 1; Fig. 2), GBM model should more emphasize other variables instead.

To confirm our speculations, for future work we might utilize frameworks such as Multiperturbation Shapley Analysis, which relies on game theory to estimate usefulness⁴⁵, include Shapley values in a state-of-the-art integrated approaches for variable selection⁴³, or implement more complex approaches, such as ensemble method Reciprocal Ranking¹⁸, adaptive variable selection approach ShapHT+⁴⁶ and other.

Our technique was only used to analyze data from one database, which is one of the study’s limitations. Consequently, the generalizability of the results to broader populations or different datasets may be limited. In addition, our analysis relied only on a single prediction model, GBM. It would be interesting to apply our method using alternative models, such as logistic regression or random forest, which utilize bagging ensemble learning techniques.

Conclussion

Completed questionnaires, prior to further analysis, should also undergo the process of eliminating careless respondents. In comparison to the referring study of Schroeders et. al., data fusion did not bring any significant improvements in the performance of the GBM model. The use case in this study further demonstrated that even though the psychometric synonym score demonstrates an immediate impact and is built with the intention of discovering careless respondents, in combination with other variables is not always the most ideal choice to be fit into a GBM model. Moreover, a variable that stores the answers to the question “I would never take a bribe, even if it was a lot” (HE01_36), average longstring (avgStr), total intra-individual response variability (irvTotal) as well as most response times (time_p3, time_p4, time_p5) are appropriate for detecting careless respondents. On the contrary, intra-individual response variability (irv1), the question that asks for an answer to “I would be tempted to use counterfeit money if I could be sure of getting away with it (Reversed).” (HE01_60_r) are better to be avoided.

The main contribution of this paper is in the interpretation of the decisions from the prediction model using Shapley values. We also showed that although additional variables do not bring better classification performance, can contribute to much more interpretable prediction models.

Data availibility

The data, provided by Schroeders et al.², used in our study is publicly accessible through the Open Science Framework (OSF) repository at https://osf.io/mct37/.

References

Meade, A. W. & Craig, S. B. Identifying careless responses in survey data. Psychol. Methods 17, 437–455. https://doi.org/10.1037/a0028085 (2012).
Article PubMed Google Scholar
Schroeders, U., Schmidt, C. & Gnambs, T. Detecting careless responding in survey data using stochastic gradient boosting. Educ. Psychol. Meas. 82, 29–56. https://doi.org/10.1177/00131644211004708 (2021).
Article PubMed PubMed Central Google Scholar
Credé, M. Random responding as a threat to the validity of effect size estimates in correlational research. Educ. Psychol. Meas. 70, 596–612 (2010).
Article Google Scholar
Johnson, J. A. Ascertaining the validity of individual protocols from Web-based personality inventories. J. Res. Pers. 39, 103–129. https://doi.org/10.1016/j.jrp.2004.09.009 (2005).
Article Google Scholar
Maniaci, M. R. & Rogge, R. D. Caring about carelessness: Participant inattention and its effects on research. J. Res. Pers. 48, 61–83 (2014).
Article Google Scholar
Huang, J. L., Curran, P. G., Keeney, J., Poposki, E. M. & DeShon, R. P. Detecting and deterring insufficient effort responding to surveys. J. Bus. Psychol. 27, 99–114 (2012).
Article Google Scholar
Niessen, A. S. M., Meijer, R. R. & Tendeiro, J. N. Detecting careless respondents in web-based questionnaires: Which method to use?. J. Res. Pers. 63, 1–11 (2016).
Article Google Scholar
Costa, P. T. & McCrae, R. R. The revised neo personality inventory (NEO-PI-R). SAGE Handb. Pers. Theory Assess. 2, 179–198 (2008).
Google Scholar
Ehlers, C., Greene-Shortridge, T., Weekley, J. & Zajack, M. The exploration of statistical methods in detecting random responding In Annual meeting of the Society for Industrial/Organizational Psychology, Atlanta, GA (2009).
Dunn, A. M., Heggestad, E. D., Shanock, L. R. & Theilgard, N. Intra-individual response variability as an indicator of insufficient effort responding: Comparison to other indicators and relationships with individual differences. J. Bus. Psychol. 33, 105–121. https://doi.org/10.1007/s10869-016-9479-0 (2018).
Article Google Scholar
Dogan, V. A novel method for detecting careless respondents in survey data: Floodlight detection of careless respondents. J. Market. Anal. 6, 95–104. https://doi.org/10.1057/s41270-018-0035-9 (2018).
Article Google Scholar
Marjanovic, Z., Holden, R., Struthers, W., Cribbie, R. & Greenglass, E. The inter-item standard deviation (ISD): An index that discriminates between conscientious and random responders. Pers. Individ. Differ. 84, 79–83 (2015).
Article Google Scholar
Goldammer, P., Annen, H., Stöckli, P. L. & Jonas, K. Careless responding in questionnaire measures: Detection, impact, and remedies. Leadersh. Q. 31, 101384 (2020).
Article Google Scholar
Wind, S. & Wang, Y. Using mokken scaling techniques to explore carelessness in survey research. Behav. Res. Methods 1–46 (2022).
Arias, V. B., Garrido, L., Jenaro, C., Martínez-Molina, A. & Arias, B. A little garbage in, lots of garbage out: Assessing the impact of careless responding in personality survey data. Behav. Res. Methods 52, 2489–2505 (2020).
Article PubMed Google Scholar
Ulitzsch, E., Yildirim-Erbasli, S. N., Gorgun, G. & Bulut, O. An explanatory mixture IRT model for careless and insufficient effort responding in self-report measures. Br. J. Math. Stat. Psychol. 75, 668–698 (2022).
Article PubMed Google Scholar
Ulitzsch, E., Pohl, S., Khorramdel, L., Kroehne, U. & von Davier, M. A response-time-based latent response mixture model for identifying and modeling careless and insufficient effort responding in survey data. Psychometrika 87, 593–619 (2022).
Article MathSciNet PubMed MATH Google Scholar
Effrosynidis, D. & Arampatzis, A. An evaluation of feature selection methods for environmental data. Eco. Inform. 61, 101224 (2021).
Article Google Scholar
Molnar, C. Interpretable Machine Learning (Lulu. com, 2020).
Bratko, I. Machine learning: Between accuracy and interpretability. Learn. Netw. Stat.https://doi.org/10.1007/978-3-7091-2668-4_10 (1997).
Article MATH Google Scholar
Stiglic, G., Mertik, M., Podgorelec, V. & Kokol, P. Using visual interpretation of small ensembles in microarray analysis. Proc. IEEE Symp. Comput. Based Med. Syst. 691–695, 2006. https://doi.org/10.1109/CBMS.2006.169 (2006).
Article Google Scholar
Martens, D., Huysmans, J., Setiono, R., Vanthienen, J. & Baesens, B. Rule extraction from support vector machines: An overview of issues and application in credit scoring. Stud. Comput. Intell. 80, 33–63. https://doi.org/10.1007/978-3-540-75390-2_2 (2008).
Article MATH Google Scholar
Hall, P., Gill, N., Kurka, M., Phan, W. & Bartz, A. Machine learning interpretability with h2o driverless ai. http://docs.h2o.ai (2019).
Kopitar, L., Cilar, L., Kocbek, P. & Stiglic, G. Local vs. global interpretability of machine learning models in type 2 diabetes mellitus screening. In International Workshop on Knowledge Representation for Health Care 108–119 (organizationSpringer, 2019).
Hinton, G., Vinyals, O. & Dean, J. Distilling the Knowledge in a Neural Network. arXiv https://doi.org/10.48550/arxiv.1503.02531 (2015). eprint1503.02531.
Stiglic, G. et al. Interpretability of machine learning-based prediction models in healthcare. Wiley Interdiscip. Rev. Data Mining Knowl. Discov. 10, e1379. https://doi.org/10.1002/widm.1379 (2020).
Article Google Scholar
Liu, N., Kumara, S. & Reich, E. Explainable data-driven modeling of patient satisfaction survey data. In Proceedings—2017 IEEE International Conference on Big Data, Big Data 2017, 2018-January, 3869–3876. https://doi.org/10.1109/BigData.2017.8258391 (2017).
Foster, E. D. & Deardorff, A. Open science framework (OSF). J. Med. Libr. Assoc. 105, 203. https://doi.org/10.5195/jmla.2017.88 (2017).
Article PubMed Central Google Scholar
Hanley, J. A. & McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982).
Article CAS PubMed Google Scholar
Friedman, J. H. Stochastic gradient boosting. Comput. Stat. Data Anal. 38, 367–378. https://doi.org/10.1016/S0167-9473(01)00065-2 (2002).
Article MathSciNet MATH Google Scholar
Sagi, O. & Rokach, L. Ensemble learning: A survey. Wiley Interdiscip. Rev. Data Mining Knowl. Discov. 8, e1249 (2018).
Article Google Scholar
Yentes, R. D. & Wilhelm, F. Careless: Procedures for computing indices of careless responding (2021). R package version 1.2.1.
Leslie, D. Understanding artificial intelligence ethics and safety: A guide for the responsible design and implementation of AI systems in the public sector, the alan turing institute. Zenodohttps://doi.org/10.5281/zenodo.3240529 (2019).
Article Google Scholar
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Systems 30 (2017).
Ma, S. & Tourani, R. Predictive and causal implications of using shapley value for model interpretation. In Proceedings of the 2020 KDD Workshop on Causal Discovery, vol. 127 of Proceedings of Machine Learning Research, 23–38 (PMLR, 2020).
Campbell, M., Shiny, R. Dashboards. Learn RStudio IDE 99–112.https://doi.org/10.1007/978-1-4842-4511-8_9 (2019).
Stiglic, G., Watson, R. & Cilar, L. R you ready? Using the R programme for statistical analysis and graphics. Res. Nurs. Health 42, 494–499. https://doi.org/10.1002/nur.21990 (2019).
Article PubMed Google Scholar
Biecek, P. Greedy function approximation: A gradient boosting machine. Ann. Stat. 29, 1189–1232. https://doi.org/10.1214/aos/1013203451 (2001).
Article MathSciNet MATH Google Scholar
Yentes, R. & Wilhelm, F. careless: Procedures for computing indices of careless responding. R Package Version 1, 2018 (2018).
Google Scholar
Chang, W. Package ‘shiny’—Web Application Framework for R Version. R package version (2016).
Chang, W. et al. shiny: Web Application Framework for R (2021). R package version 1.6.0.
Biecek, P. Dalex: Explainers for complex predictive models in r. J. Mach. Learn. Res. 19, 1–5. https://doi.org/10.5281/zenodo.3670940 (2018).
Article MathSciNet Google Scholar
Gramegna, A. & Giudici, P. Shapley feature selection. FinTech 1, 72–80 (2022).
Article Google Scholar
Ward, M. & Meade, A. W. Dealing with careless responding in survey data: Prevention, identification, and recommended best practices. Annu. Rev. Psychol. 74 (2023).
Cohen, S., Ruppin, E. & Dror, G. Feature selection based on the shapley value. Other words 1, 98Eqr (2005).
Yin, D., Chen, D., Tang, Y., Dong, H. & Li, X. Adaptive feature selection with Shapley and hypothetical testing: Case study of EEG feature engineering. Inf. Sci. 586, 374–390 (2022).
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the Slovenian Research Agency under Grants ARRS P2-0057 and ARRS N3-0307.

Author information

Authors and Affiliations

Faculty of Health Sciences, University of Maribor, Maribor, Slovenia
Leon Kopitar & Gregor Stiglic
Faculty of Electrical Engineering and Computer Science, University of Maribor, Maribor, Slovenia
Leon Kopitar & Gregor Stiglic
Usher Institute, University of Edinburgh, Edinburgh, UK
Gregor Stiglic

Authors

Leon Kopitar
View author publications
You can also search for this author in PubMed Google Scholar
Gregor Stiglic
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, G.S. and L.K.; methodology, L.K.; software, L.K.; validation, L.K. and G.S.; formal analysis, L.K.; investigation, L.K.; visualization, L.K.; supervision, G.S.; project administration, G.S.; funding acquisition, G.S. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Leon Kopitar.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kopitar, L., Stiglic, G. Using heterogeneous sources of data and interpretability of prediction models to explain the characteristics of careless respondents in survey data. Sci Rep 13, 13417 (2023). https://doi.org/10.1038/s41598-023-40209-2

Download citation

Received: 26 January 2023
Accepted: 07 August 2023
Published: 17 August 2023
DOI: https://doi.org/10.1038/s41598-023-40209-2

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

The quest for the reliability of machine learning models in binary classification on tabular data

Identifying the roots of inequality of opportunity in South Korea by application of algorithmic approaches

A prediction-focused approach to personality modeling

Introduction

Materials and methods

Experimental setup

Evaluation metrics

Gradient boosting machine

Indices of careless responding

Shapley additive explanations

Supplementary web application

Statistical analysis

Ethics declarations

Results

Comparison of SHAP contributions

Comparison 1: Correct decisions of GBM model among careless and regular respondents

Comparison 2: Correct/incorrect decisions of GBM model in careless respondents

Comparison 3: Shapley values of the first respondent with and without indices of careless responding (all, all_extracted)

Summary of SHAP contribution comparison

Discussion

Conclussion

Data availibility

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links