Enhancing investigative interview training using a child avatar system: a comparative study of interactive environments

The impact of investigative interviews by police and Child Protective Services (CPS) on abused children can be profound, making effective training vital. Quality in these interviews often falls short and current training programs are insufficient in enabling adherence to best practice. We present a system for simulating an interactive environment with alleged abuse victims using a child avatar. The purpose of the system is to improve the quality of investigative interviewing by providing a realistic and engaging training experience for police and CPS personnel. We conducted a user study to assess the efficacy of four interactive platforms: VR, 2D desktop, audio, and text chat. CPS workers and child welfare students rated the quality of experience (QoE), realism, responsiveness, immersion, and flow. We also evaluated perceived learning impact, engagement in learning, self-efficacy, and alignment with best practice guidelines. Our findings indicate VR as superior in four out of five quality aspects, with 66% participants favoring it for immersive, realistic training. Quality of questions posed is crucial to these interviews. Distinguishing between appropriate and inappropriate questions, we achieved 87% balanced accuracy in providing effective feedback using our question classification model. Furthermore, CPS professionals demonstrated superior interview quality compared to non-professionals, independent of the platform.

sensitive context.For instance, experts may have a more nuanced understanding of the non-verbal cues and behavioral patterns of children, which can influence their assessment of the realism and effectiveness of virtual child avatars for interviews.It is essential to conduct further research to explore the feasibility and potential benefits and risks of using VR for investigative interviews with artificial children.This motivated our comparative study to assess the QoE of expert participants in various interactive environments.The aim of our research is to investigate the possibility and potential advantages and drawbacks of utilizing VR as a means of conducting investigative interviews with artificial children, in comparison to other interactive mediums like 2D, audio, and text.Our study involved participation from CPS workers and child welfare psychology students, referred to as non-CPS in our paper.
The first part of the study design involves participants engaging with a child avatar for 90 seconds in various environments and sharing their feedback on the experience.ITU-T, which stands for the International Telecommunication Union's Telecommunication Standardization Sector, provides recommendations through ITU-T P.809 for conducting subjective experiments aimed at assessing the QoE for gaming services 50 .According to recommendations, short interactive sessions are suitable for measuring overall QoE, but may not be sufficient for assessing flow and immersion for sessions lasting less than 10 to 12 minutes.We argue, however, that it could be adequate for comparing different stimuli within the same setup.In the second part of the study, we also ask participants to interact with the child avatar just in VR environment without any time constraints.The motivation for second part came from the positive feedback we received from the pilot study 32 , where participants gave high ratings for quality metrics and QoE in the VR environment.Our aim was to gather more data on the VR experience in investigative interview training and investigate any disparities in short and long interactive sessions in VR.Furthermore, we sought to confirm our hypothesis that brief interactive sessions are sufficient for comparing stimuli in different environments.Both groups, CPS and non-CPS, took part in both the first and second parts of the study, and their data were collectively analyzed.
The system architecture uses the dialogue model described in 32 with a new persona of a child.In addition to VR in current system architecture, for this study we also added three additional interactive environments (2D desktop, audio and text).We focused on identifying the optimal interactive environment to enhance the learning outcomes of the investigative interview training system.We do not evaluate the dialogue model or the participants' ability to conduct investigative interviews during the user study.The findings of this study have implications for the training of police and CPS personnel in investigative interviewing.We aim to provide a more engaging and realistic training experience, which can lead to better learning outcomes and higher motivation for practice.
In investigative interview training, it is crucial to provide trainees with feedback that can help them improve their questioning techniques.Accurately distinguishing between recommended and not recommended questions, the training program can provide trainees with targeted feedback, enabling them to learn from their mistakes and improve their questioning techniques 51 .So apart from conducting the user study, we introduce and validate our classification model that can assess the quality of interviews by categorizing the questions as either openended or closed-ended.This model will be integrated into the system and provide feedback to the participants on how did they perform.In this paper, we focus on validating the performance of our classification model and conducting a comparative analysis of the interview quality between two groups: CPS and non-CPS.To the best our knowledge, Haginoya et al. 's model 52 is the only relevant work with which we can compare our classification model's performance.They present an automated question classification system for simulated CSA interviews using avatar.The system was found to work well in classifying interviewer questions, and automated interventions (feedback and modeling) were provided to improve interview quality.They managed to achieve an accuracy of 70% for their classifier.In their study, the intervention groups showed improvement in their second interview while the no intervention group did not.In their research, the authors utilize N-gram features in combination with an XGBoost classifier.N-gram refers to a grouping of N consecutive characters, words, or phrases within a given text.For instance, if we apply bigrams to the question "tell me everything" we extract three distinct combinations: "tell me, " "me everything, " and "everything about, " each of which represents a pair of adjacent words.In this case, we assign a frequency value of "1" to each of these bigrams to indicate their occurrence within the text and use it as features for training machine learning model.XGBoost, an abbreviation for Extreme Gradient Boosting, is a powerful machine learning algorithm employed in both classification and regression tasks.It operates by aggregating the predictions of numerous weak classifiers to make decisions.)Their methodology is constrained by its limited contextual window around a given word.This approach, however, falls short in its ability to comprehensively understand and contextualize entire sentences.
While addressing the limitation in previous work our paper aimed to address following questions: • How do different environments influence the users' QoE, presence, flow, realism, responsiveness, and learning experience?• Does different environments affect the quality of the interview?• Are 90 seconds enough for the comparison of different stimuli and to capture user experience for different system modalities?
In this paper, we build upon our work earlier work 53 where we provided preliminary findings for the first part of the comparative study that we conducted.Specifically, we delve deeper into the qualitative feedback received from the study participants to gain a better understanding of their experiences.Furthermore, we introduce a feedback classification model and analyze its performance to better understand its potential applications.The main contributions of our work presented in this paper are: • The development of a new dialogue model with a persona resembling a child that was exposed to sexual abuse.• Extending the prototype of the child avatar in VR to three more environments of 2D, audio, and text.
• Conducting a comparative study for the four different interactive environments and evaluating them based on multiple quality experiences which captures user experience.• A binary classification model, that can learn to recognize patterns in the interview questions and categorize them as either closed-ended or open-ended to assess the quality of interviews

Results
In this section, we present the results of part 1 of our study in which we compared the user experience in four different interactive environments, namely VR, 2D, audio, and text, over a 90-second period.We analyzed the participants' subjective feedback about each environment, as well as the quality of questions asked in each environment.Furthermore, as part two of our study, we evaluated the participants' experience in the VR environment during an unrestricted session.Additionally, we assessed a classification model's ability to distinguish between open-ended and close-ended questions.

Study Part-1
We use five quality assessment metrics namely, QoE, realism, responsiveness, presence and flow to compare the user experience in each environment.We employ the Mean Opinion Score (MOS) to gauge the comprehensive quality assessment of a system.MOS is derived by averaging the individual ratings provided by participants for each quality metric.Figure 1 shows that VR has an higher average ratings for each quality metrics except responsiveness.We discussed the detailed results for each metric in this section.

Quality of experience
QoE is the degree of delight or annoyance of the user of an application or service, which results from the fulfillment of the users' expectations with respect to the utility and/or enjoyment of the service 43 .Figure 1a shows participants' overall QoE ratings were averaged across conditions.We observe that VR created a slightly higher overall QoE ( M = 4.50, SD = 1.40 ), followed by 2D ( M = 4.25, SD = 1.52 ), Text ( M = 4.10, SD = 0.85 ) and Audio ( M = 3.95, SD = 1.32 ).However, a one-way repeated measured Analysis of variance (ANOVA), with a significance level of α = 0.05 , finds no significant differences ( F(3.0, 17.0) = 1.115, p = 0.371 ) between the MOS of overall QoE across different environments.

Presence
The quality of presence refers to a psychological state where an individual feels fully immersed and engaged in an environment, as if they are physically present in that space 44,45 .To evaluate any significant effects of the environments on the sense of presence, a one-way repeated measure ANOVA is used.The result of F(3.0, 18.0) = 5.528, p = 0.007 indicates a significant effect.The post-hoc tests followed by a Bonferroni cor- rection indicate that, as we expected, VR ( M = 5.08 , SD = 1.35 ) creates a significantly higher presence than 2D ( M = 4.06 , SD = 1.49, p = 0.048 ) and Audio ( M = 3.78 , SD = 1.64 , p = 0.005 ), but surprisingly not compared to Text ( M = 3.87 , SD = 1.59 , p = 0.070 ).No other pairwise comparisons were found to be significant.

Realism
The realism score is the mean of three questions regarding realism in the appearance, talking, and lip-sync of the avatar, mentioned in methods section.Distinct questionnaires were used to assess the realism of VR and 2D in comparison to audio and text, primarily due to the visual element involved.The impact of different environments on the perceived realism of the avatar was investigated using a one-way repeated measure ANOVA  www.nature.com/scientificreports/highest ( M = 4.57 , SD = 1.20 ).The post-hoc test followed by a Bonferroni correction indicates that there is a significant difference for the VR-Audio pair ( p = 0.033 ), but not for any other pairs.

Flow
Flow refers to a state of optimal experience in which a player is fully immersed and engaged in an interactive environment.It is a key indicator of a positive experience, as it indicates that the user is fully engaged and enjoying the experience 54 .For the flow, one-way repeated measure ANOVA indicates a significant impact of environments ( F(3.0, 17.0) = 9.020, p = 0.001 ).Text had a MOS of (M) = 2.60 , (SD) = 1.33 , Audio had a M = 2.78 , SD = 1.28 , 2D had a M = 2.88 , SD = 1.13 and VR the highest score of M = 4.00 , SD = 1.40 .The post-hoc tests followed by a Bonferroni correction indicate that VR was rated significantly higher than 2D ( p = 0.03 ), Audio ( p = 0.001 ), and Text ( p = 0.008 ), but there were no significant differences between any other pairs.

Responsiveness
Responsiveness describes the temporal aspects of the feedback a player receives after performing an action 55 .The quicker the response time, the more responsive the environment is regarded.Moreover, the fluidity of the visual and audio feedback also contributes to the responsiveness, impacting the overall experience 56 .One-way repeated measure ANOVA of responsiveness for the avatar in each environment does not show a significant difference among them ( F(3.0, 17.0) = 2.893, p = 0.064 ).The highest MOS was achieved by text ( (M) = 4.63 , (SD) = 1.19 ), followed by Audio ( M = 4.06 , SD = 0.96 ), VR ( M = 3.79 , SD = 1.10 ) and 2D ( M = 3.76 , SD = 1.01 ).This can be attributed to the STT and TTS synthesis in the environment with the audio element involved.

Learning experience
Apart from questions relating to user experience, we also incorporated three questions that pertained to the learning experience, which can be found in methods section.Figure 1b displays the average rating for these three questions.To investigate whether diverse environments could result in varying learning experiences, a one-way repeated measures ANOVA was conducted.Table 1 reveals that the different environments did not have any significant impact on communication and self-efficacy.Nevertheless, there was a noteworthy disparity between the perceived user engagement during learning across the various environments.A pairwise post-hoc comparison followed by a Bonferroni correction demonstrated that VR led to significantly higher engagement compared to Audio ( p = 0.022 ) and Text ( p = 0.020 ), but not when compared to 2D ( p = 0.85 ).No significant differences were detected between any other pairs.

Sensitivity analysis and effect size comparisons
We further scrutinized our user study's data through sensitivity analysis to gauge the robustness of our findings.
The sensitivity analysis underscored that, with our sample size of 21 and a desired power of 0.8, the smallest effect size (f) our study could reliably detect when comparing four different environments was approximately 0.81.Table 2 shows the observed effects for each metric, derived from the ANOVA test.
Comparing these observed effect sizes to our desired effect highlighted potential limitations of our study.The observed effect size for QoE and self-efficay was considerably below the desired effect.Metrics such as presence, responsiveness, and engagement displayed effect sizes that, while meaningful, were below the threshold of what our study was optimally powered to detect.In contrast, flow demonstrated stronger effects, suggesting that our sample size was more aligned with capturing such effects.An observed effect size that is less than the desired effect size does not invalidate your findings.It suggests that there is a higher risk of Type II errors which imply failing to detect an effect that is there.Conducting a future study with a larger sample size can reduce the risk of errors.

Study Part-2
The motivation behind this study was to assess the overall user experience of participants in the absence of time constraints.Additionally, the study aimed to validate our design decision that a 90-second duration is sufficient for users to engage with an interactive environment effectively and provide reliable feedback.Figure 2 shows that there is no significant difference in participants' ratings of the quality aspects between the 90-second VR session and the unconstrained session in VR.Table 3 shows the results of the repeated measure t-test between the long and short VR experience across different quality metrics, which included QoE, realism, responsiveness, presence, and flow.There were no significant differences observed in the assessment metrics, when comparing short stimulus tests to longer sessions for VR., and our model.The metrics used to evaluate the models include macro F1-score, weighted F1-score, balanced accuracy, and Matthews correlation coefficient (MCC).
From Table 4a, we can observe that the dummy model, which randomly predicts labels, has the lowest performance in all metrics.The Haginoya et al. 's model has better performance than the dummy model, but it is not clear from the reported data how it performs relative to the our proposed model since they only reported the accuracy score which we assume is not balanced accuracy and rest of the metrics are missing.Our model still outperforms the Haginoya's models in basic accuracy metric.
The performance of the proposed classification model was further evaluated for each class.Table 4b  Overall, the proposed model's performance is good, but there is room for improvement in predicting the closed class.We are convinced that transforming it into a multi-label task will enhance generalization and reduce false predictions.According to Lamb et al. 51 , the inter-coder reliability rate among experts for coding investigative interviews is approximately 85%.While our classification model outperforms other automated coding methodologies, as stated in Haginoya et al. 's research 52 , our goal is to achieve expert-level precision.www.nature.com/scientificreports/The model's performance, as indicated by the metrics we used is quite promising, with a significant improvement over the Dummy model and potentially better performance than the Haginoya model.Despite the stepped appearance of the PRC, which suggests some limitations in the model's ability to distinguish between classes or a lack of diverse confidence score distributions, the overall metrics demonstrate good performance.The model has high precision, recall, and F1-scores for both classes, indicating a good balance between correctly identifying true positives and minimizing false positives.Additionally, balanced accuracy and MCC values further support the ability of the model in classifying both positive and negative classes.While the Precision-Recall curve's (PRC), in figure 3a, appearance suggests potential areas for improvement, the model's performance is quite effective based on the provided metrics and the fact that the dataset is biased.

Quality of interviews
Table 5 presents the average number of questions asked in each environment.The average number of turns in the text environment was limited to 4.5 and Long VR was 29.7, due to being an outlier they are removed from the following data analyses, which focus on comparing the quality of the interviewers' questions between different environments.Figure 4a shows the percentage of open-ended questions asked between CPS and non-CPS participants in different environments.Overall, 64% of the questions asked by CPS participants were open-ended, while for non-CPS it was 48%.That shows, as expected, experts conduct higher-quality interviews.However, the number of open-ended questions asked could have been higher among CPS.
Furthermore, no difference between the environment and quality of the interview even when split by CPS versus non CPS, which shows the higher cognitive load created in environments with visual elements does not distract the users from conducting interviews.Figure 4b shows no significant impact of the order in which participants interact with environments over the quality of the interviews.
A chi-square test of independence was performed to evaluate the relationship between the training environment (2D, audio, VR short and VR long) and question type (open versus closed-ended questions).The relationship between these variables was not significant, X 2 (3) = 0.143, p = 0.986 .Thus, the training environment did not affect the quality of the interviews.

Qualitative results
A 4-item post-test questionnaire was utilised, as shown in Table 9, which was then followed by an open question asking participants to provide their reasons behind their selection.A top-down coding approach was utilised to analyse the post-test questionnaire, as shown in Table 6.Many of the participants selected VR as their preferred environment.76% of the participants perceived VR as the most realistic environment, 66% preferred VR over other environments, and 61% agreed that VR created the best experience for them.Participants with previous VR experience showed a preference for VR -80% rated VR favourable in the post-test questionnaire.However, also 75% of the participants with no VR experience preferred VR over the other environments, and 60% rated VR as the environment that they liked, preferred, and created the best experience.Sentiment towards environment addressed the participants' feelings about using the training program in 2D, audio, text, and VR environments.Sentiments were categorized into positive or negative feelings and were examined by whether the participant worked within CPS or not.While discussing participants' feedback, we add extra text in '[ ]' to their responses, to adjust the tense and make participants' feedback grammatically correct.CPS participants only expressed negative sentiments regarding the audio environment.Participants noted that audio "felt strange as [they] mostly talk to children face-to-face", that "it [was] difficult to read body language" and that it "miss[es] the non-verbal communication".Regarding the text environment, CPS participants were generally positive, stating that it felt familiar and that it "can be a good way to practice, as there will be many times when you also work on chat".However, a non-CPS participant felt like "it [was] difficult to interact with the avatar in text".A few participants expressed that they preferred the 2D environment as it has both the visual and body language of the child, and they are used to working with PCs.CPS participants had positive experiences with 2D, with a participant expressing that "2D felt more familiar" and it was "even better than audio, because it's better to see the child [they were] talking to".The VR environment generated the most positive responses, with many participants expressing that the realism of the avatar made them feel like it was easier to talk to the child.However, one CPS participant stated that this environment "[did] not really reflect [their] experience with having these kinds of conversations...because it seem[ed] a bit scripted...like you have to have the right kind of question to get a reply".
Participants also reflected on the Realism of the interactive training program.Most participants stated that VR created the most realistic environment compared to the other environments.One participant stated that VR "made [them] feel like [they were] in the same room as Hillary" and that the realism made it "easier to talk to the child".However, one participant noted that despite the realism, they quote "didn't connect with Hillary".These views were similar in regards to the 2D environment, with participants agreeing that it felt realistic but not as realistic as VR.Regarding the audio and text environments, participants stated that these "environments did not feel as real" compared to the other environments.A few participants expressed that although the text is not as realistic as VR, it can be a better environment for training as they are used to this way of communication compared to VR.One participant noted that although "text only was not very realistic, [it] can work when training".Overall, the participants' descriptions gave the impression that realism is an important aspect of an interactive training program.
Overall, the participants responded quite positively towards their experiences in this study.One participant "believe[ed] this way to practice is a good option [for] training in the school setting".However, some participants had difficulties using the program as one participant stated that "it would have been much easier if everything was in Norwegian" and that "it took some time to formulate and write in English".Only one participant reported a negative overall experience, stating that it was "frustrating all together, the system did not give me any responses that [were] useful in VR, audio, and 2D".

Discussion
Our study aimed to compare the quality and learning aspects of experts in four different environments.Interactions lasted for 90 seconds in each environment.We observed that audio-based environments had more conversational turns than the text environments, as participants typed slower than they spoke.Despite a consistent dialogue model used in all environments, participants' experiences were influenced by the responses they received from the model.Environments with fewer responses limited participants' information, potentially affecting the richness and depth of interactions.As a consequence, the results of the study could have been impacted by this.Responses from the avatar were shaped by the questions asked, and because the questions themselves could vary, they also had the potential to influence participants' experiences across different environments.It is worth noting that although the conversation's quality may differ, we did not evaluate the dialogue model in this study.
Our findings indicate that VR was rated higher in four out of five quality assessments used for comparative analysis.However, there was no significant difference in the overall rated QoE for each environment.As anticipated, participants rated VR highest in terms of presence, indicating a greater sense of being part of the environment compared to the other options.VR also generated a significantly greater flow for users, creating an immersive environment that caused participants to lose track of time and forget their surroundings.These quality aspects may enhance users' willingness to engage, potentially leading to longer practice sessions with the child avatar system.When examining the quantitative analysis for realism, the only noteworthy variation was found between the VR-audio pair.The text environment was rated highest in responsiveness due to the lack of delay because of TTS and STT synthesis, and participants were in control of when to send their messages.In addition,  1b, all three metrics measuring the learning experience have a relatively higher mean for VR.Of particular significance is the higher user engagement for VR compared to audio and text environments.Mulqueeny et al. 57 highlight the importance of engagement in achieving long-term learning and outcomes.With significantly greater engagement and the highest average values across all learning experience metrics, VR may lead to better learning and knowledge transfer.The majority of participants preferred VR as their interactive environment for the interview training system, which was likely influenced by the fact that most users were using VR for the first time and found it fascinating.Nevertheless, it is premature to assert that VR is the superior environment at this stage.While VR offers better levels of realism, presence, and flow, about 40% of the participants preferred other environments due to familiarity, accessibility, and regularity of use.Moreover, it is important to note the size of our study with the two different groups of our participants, CPS students and CPS professionals.We found both of these groups to be well-suited for participation in our study due to their shared knowledge, training, and experience in conducting interviews with children.Recruiting participants with relevant experience was a crucial and difficult job.When we looked closely at our results using sensitivity analysis, we found something crucial.While our findings are significant and provide valuable insights into the various metrics, the study might not have been optimally powered to detect certain nuanced effects, especially for specific metrics.This is why we aim to continue our research with more participants to get a clearer picture.Additionally, future studies should look more closely at each group.This could help us to see if different levels of expertise lead to different results.As we continue our research, these insights will be essential to help us improve our future studies even better.
The quality of the questions plays a crucial role in determining the effectiveness of an interview.Open-ended questions tend to elicit more informative responses, while closed-ended questions can yield potentially inaccurate answers.A system that accurately classifies questions can significantly impact the quality of interview training by providing effective feedback.Distinguishing between recommended and not recommended questions is essential for delivering high-quality feedback.
Achieving a balanced accuracy of 87% in classifying these two categories is a notable accomplishment, surpassing the state-of-the-art results of 72% reported by Haginoya et al. 52 .Although the model has a weighted average F1 score of 0.87, there is room for improvement in its ability to predict closed questions correctly.Reducing the original 15 question types to two fundamental categories may have affected the results when converting the multi-class task to a binary classification task.Future work could focus on refining the model's performance in classifying closed questions to further enhance the interview training experience.Furthermore, to accurately measure the learning impact of the system, a more extensive study is needed, where each participant conducts multiple interviews with a child.This approach would allow us to assess the quality of the interviews over time, similar to the pattern observed in the studies by Pomedda et al. 28,30 .
When the classification model was applied to the conversational data from the study, the results revealed that the interviews conducted by both CPS professionals and non-CPS participants were of satisfactory quality using a rather high percentage of open-ended questions.However, the results also indicate that the interviewers rely on a high frequency of closed-ended questions, which is not good according to best practice standards.Closed-ended questions should be minimized as children often respond with little information.In addition, these questions tend to be suggestive, prone to response biases, and problematic to be used in investigative interviews 58 .Nevertheless, after examining the data, it was observed that the quality of interviews conducted by CPS professionals was comparatively better than those conducted by non-CPS participants.The participants seemed to perform consistently across different settings, indicating that the interview environment may not have a significant impact on their overall performance.Further research with more participants is necessary to determine the most suitable environment for interview training, particularly in terms of QoE and learning effect.
Future research should prioritize improving the various quality aspects of our child avatar system to provide a more effective learning platform for interview trainees.Our objective is to develop a flexible and efficient interview training system that is conducive to effective learning.Our future work will mainly focus on enhancing the dialogue model to generate more realistic responses by leveraging the advances made in large language models (LLMs), in conjunction with the use of more realistic-looking avatars generated using neural radiance fields (NeRF) or generative adversarial networks (GANs).In addition, we will also focus on improving the quality of feedback and exploring how feedback can be effectively utilized to enhance the overall learning experience.

System architecture
In this section, we discuss the overall architecture of the interactive child avatar system architecture for conducting a comparative study with CPS professionals and students shown in Figure 5.The system consists of three main components: (i) The language component, which comprises a dialogue model that we developed using the RASA 59 framework and a classification model that enables us to evaluate the quality of interviews; (ii) The speech synthesis component, which utilises IBM Watson services to provide speech-to-text (STT) and text-to-speech (TTS) synthesis 60 ; and (iii) The front-end of the system, which offers four different interactive environments: a VR environment created using Unity and Oculus Quest2, a 2D environment and audio also developed on Unity.The text-based environment was created using the RasaX platform.

Language
The language component of the model is designed to analyze a child's conversation and produce realistic and suitable responses to an interviewer's questions.Furthermore, there is a classification model that can offer valuable feedback on the quality of interviews and assist in enhancing interview skills.www.nature.com/scientificreports/

Dataset
The dataset consists of transcripts from interviews conducted by professionals according to specific guidelines, as part of a training program at the Centre for Investigative Interviewing, which is situated in Griffith University, Australia 61 .The transcripts consist of simulated dialogues between professional actors portraying 5-7-year-old children and an interviewer who could be from CPS or law enforcement.These conversations aim to replicate an investigative interview with a child who may have been a victim of abuse.

Dialogue model
The current dialogue model is specifically designed to apply the dialogical communication method (DCM) 62 , which is an investigative interviewing methodology currently employed in Norway.The Norwegian model is based on the research-based protocol developed by Michael Lamb and his colleagues at the National Institute of Child Health and Human Development (NICHD) 13 .The dialogue model was developed using an open-source framework of RASA which runs on tensorflow back-end.It has a modular structure with deep learning models, or rule-based techniques, which are employed to set up a complete dialogue model.Each module accepts the data in a predefined format, which requires quite a lot of pre-processing and data augmentation work for training.The Natural-language Understanding (NLU) module uses the Dual Intent and Entity Transformer (DIET) 63 architecture to model a joint task of intent and entity recognition.It processes the input text and predicts the best possible intent from the training data along with entities in that input.The rule module is used to model fallback strategies or control the behaviour of the dialogue model where necessary.According to the rules defined, a specific appropriate action is performed if a certain intent is recognized.These actions are defined in the action modules.The story module is used to control the evolution of the conversation and control the flow of conversation based on questions asked based on slot values defined in the NLU module or some specific intent.If a user asks a leading question, the child avatar answers with a vague or unproductive reply with a certain probability.This is designed to reinforce the need of asking open-ended questions and communicate negative feedback to the user.Table 7 presents an excerpt of a participant's conversation during the study, with the participant's name removed for privacy purposes.The conversation shows that interviewer started with a bit of introduction and rapport building and then transition into substantive phase.Table 7 don't have the complete substantive as later child start talking about the alleged abuse.
Front-end We used the system developed for our pilot study as the baseline 32 .The VR environment is running through an Oculus Quest 2 as a front-end.Three more interactive environments have been added to the current VRbased system.Thus, users can interact with the child's avatar in four different environments.Regardless of the environments (front-end), the system's back-end has the same dialogue model, which means that the responses generated by the virtual child remain consistent across different environments.The front-end of the child avatar system was developed in the Unity game engine in all environments except the text-only environment.RASA-X is used to develop a web interface for the text-only environment, where users have a textual conversation with the child.The 2D environment runs on a PC displayed using a 24-inch desktop monitor, and the audio-only environment was the same as the 2D with the difference that no visual elements were involved.The same child (iii) Front-end of the system provides four interactive environments: a VR environment created using Unity and Oculus Quest2, a 2D environment with audio also developed using Unity, and a text-based environment created using the RasaX platform.
Vol:.( 1234567890 • We affirm our ethical commitment to ensure legality, fairness, and transparency (Article 5.1 a) by providing clear information and obtaining consent for data processing.Before beginning, participants were informed about the nature of their conversation with the virtual child and the potential adverse effects of prolonged VR use, such as dizziness or cyber-sickness.They signed a consent form allowing their feedback to be stored for research purposes, with the option to withdraw consent at any time in the future.The entire study took approximately 40 minutes for participants to complete.Participants were given a 200 Norwegian Kroner gift card for their voluntary participation.
Study participants were recruited in April 2022.The recruitment was specifically designed to target two groups: students specializing in child welfare psychology at Oslo Metropolitan University (Oslomet), and professionals currently employed in Child Protective Services (CPS).Information about the study was disseminated to these demographics to ensure a comprehensive representation of individuals with relevant knowledge and experience in child welfare.The study was executed during the time span from April to May of 2022, specifically during weeks 17 through 19.All participants were invited to attend at the SimulaMet research facilities, where they participated in the study.The data collected from the user study underwent an initial processing phase in July 2022, contributing to our preliminary work 53 .Subsequently, an additional round of data processing was performed in December 2022 and January 2023, purposed to facilitate the current extension of our work.

Storyline
To choose a storyline for the experiment, the interview transcripts were categorized into different personas.The persona selected for this study was determined by the abundance of transcripts and the extent of information that could be used to generate dialogues of sufficient length.The chosen persona was named Hillary and had the following background: Hillary is a 6-year-old girl whose cognitive, emotional, social, and physical development appears to be typical for her age.She has several close friends at school, including one named Janet.Hillary lives with her biological parents and a younger brother.She has just started her first year of school and appears to have adapted well.Occasionally, she visits Janet at her home for play dates.Janet lives with her biological parents and her grandfather, whom the children refer to as "Pop." If Janet's parents are still at work, Pop usually takes care of the children.During these visits, Hillary and Janet usually play indoors and outdoors in the garden until Hillary's parents come to pick her up and take her home.A few days ago, while Hillary was in the car with her mother after a visit with Janet, she told her mother that Pop did something inappropriate to her and touched her private parts while they were in the pool in the garden.As a result, Hillary has come to the interview to discuss this incident.

Demographic of the participants
In this research, a total of 21 participants took part, consisting of 20 females and one male.Their ages ranged from 20 to 59 years, with a mean age of 31.2 years and a standard deviation of 9.7.To begin with, they were requested to provide their ratings on their familiarity with interviewing children and VR.Approximately 60% of the participants had one or more years of experience as a CPS professional.Nonetheless, despite their expertise in CPS, most of them (75%) had no prior experience with VR.Each participant interacted with a child avatar in all environments, and none of them reported experiencing any form of motion sickness while utilizing the prototype.

Questionnaire
In addition to a question measuring the overall user QoE, we included three questions regarding responsiveness with GIQS 68 , three questions regarding realism of the avatar inspired by the work conducted by Wilson et al. 69 , two questions for measuring flow from GEQ 70 , three questions for presence from IPQ 71 , and three questions measuring the learning effects, engagement in learning, and self-efficacy as follows.
• Communication skills using this system aids me in acquiring knowledge and skills in interviewing maltreat- ment children.• Engagement using this system would increase my engagement in the learning process.
• Self-efficacy using this system can improve my self-efficacy.
The questionnaire was answered on a 7-point Likert scale.The post-questionnaire is shown in Table 9 to measure the post-experience and user preference as the environment.
. The result F(3.0, 17.0) = 3.532, p = 0.037 indicates a significant difference.The Text environment had a MOS of (M) = 4.10 , (SD) = 1.07 , Audio had a M = 3.75 , SD = 1.45 , 2D had a M = 4.35 , SD = 1.18 and VR scored the (a) Bar-plot (95% confidence interval) of means showing quality assessments for each environment.(b) Bar-plot (95% confidence interval) of means showing learning experience assessments for each environment.
presents a per-class analysis of our model's performance on test data.The classes are closed and open questions, and the metrics used to evaluate the performance are precision, recall, and F1-score.From table 4b, we can observe that the proposed model performs better for the open class than the closed class, as indicated by higher precision, recall, and F1-score values for the open class.The model's precision for the closed class is lower than that for the open class, which means that there are more false positive predictions for the closed class.The model's recall for the closed class is lower than that for the open class, which means that there are more false negative predictions for the closed class.Figure 3b illustrates that false negatives (FN) are more frequent for closed-ended questions than open-ended ones.

( a )
The Precision-Recall (PR) curve: Representation the trade-off between the precision and recall of the classification model.(b) Confusion matrix: A visual representation of the performance of a classification model, showing the distribution of predicted and actual class labels.

Figure 3 .
Figure 3. Qualitative analysis of classification model performance.

( a )
Quality of questions asked for each environment for two different groups.(b) Distribution of open and closed questions asked

Figure 5 .
Figure 5. Architecture of the child avatar system for our comparative study with CPS professionals and students.(i) Language component incorporates a dialogue model that we developed using the RASA framework, which was trained on mock interviews, and a classification model for evaluating the quality of the interviews.(ii) Speech Synthesis component utilizes IBM Watson services for speech-to-text and text-to-speech synthesis.(iii)Front-end of the system provides four interactive environments: a VR environment created using Unity and Oculus Quest2, a 2D environment with audio also developed using Unity, and a text-based environment created using the RasaX platform.

Table 1 .
The results of one-way ANOVA tests comparing the means of learning experience aspects in different environments.The mean difference is significant at the .05level.

Table 2 .
Partial eta squared (η 2 p ) and cohen's f score calculated using sum of squares of effects ( SS effect ) and sum of squares of error ( SS error ) from ANOVA test.

Table 3 .
The results of one-way ANOVA tests comparing the means of learning experience aspects in different environments.The mean difference is significant at the .05level.

Table 4 .
Evaluation of the classification model: Test Data Classification Report.Best results are highlighted using bold fonts."?" represents that information is not available. (a

Table 5 .
The average number of questions asked in the 90s interview for each environment.

Table 6 .
The results of the user study to evaluate the number of votes in response to the questions.

Table 9 .
List of post-questionnaire items.Which environment felt more realistic for you?Q2: Which environment do you like the most?Q3: Which environment do you prefer to use? Q4: Which environment creates the best experience for you?