The generation and utilization of research knowledge plays a vital role in addressing inequities within the education system (Farley-Ripple et al., 2018; Honig & Coburn, 2008; Denaro et al., 2022), however it often does not play this role for many reasons (Malin et al., 2020). This realization has resulted in an expanding “knowledge field” which seeks to better understand how research evidence could have a greater bearing on policy and practice decisions (Lockton et al., 2022; Rycroft-Smith, 2022). Researchers in the “knowledge field” have used many terms to describe the set of functions and processes in which research evidence is produced, shared, and used by members of the research, practice, and policy communities. For this project, we draw from Shaxson et al. (2012) concept paper, which introduced the term K* to describe the “set of functions and processes at the interfaces between knowledge, practice, and policy” (p. 2). In other words, K* is focused on connecting researchers and their work to organizations and communities outside of academia so that research is useful, useable, and utilized. K* is part of a broader semantic cluster that includes the ideas of ‘knowledge brokering,’ ‘boundary spanning,’ ‘knowledge mobilization,’ ‘knowledge translation,’ ‘knowledge exchange,’ ‘knowledge extension,’ ‘engaged scholarship,’ and ‘dissemination and implementation.’ Definitions of these terms are provided in Supplemental File A. Shaxson and colleagues described K* as a ‘catch-all’ term, intended to represent the wide range of concepts within the larger cluster of terms.

Work is being done in multiple areas to create an environment conducive to K*, including funding research on the topic, supporting interactions between researchers and research users, developing policies that mandate open-access publishing of research findings, developing networks to serve the research needs of practice-based organizations, and building the capacity of individuals to promote and enable K*. Although it is beyond the scope of this paper to provide a detailed summation of the approaches being taken to improve the production, sharing, and use of research knowledge, we encourage readers to read the reviews by Fahim et al. (2023) and Walter et al. (2005) which provide a more thorough description of the topic.

Findings from Fahim et al. (2023) and Walter et al. (2005) suggest multi-pronged approaches are needed to promote research use. One important strategy is the development of individuals’ knowledge, skills, and confidence to promote and enable K* (Holmes et al., 2014; Mishra et al., 2011; Tabak et al., 2017; Halsall et al., 2022), as capacity building may play a crucial role in predisposing change more broadly (Davis & Davis, 2009; Golhasany & Harvey, 2023). As such, more funding has been called for (Cooper et al., 2018; Phipps et al., 2016; Georgalakis & Rose, 2021) and invested to strengthen individual capacity in this area (e.g., Holmes et al., 2012; Garritzmann et al., 2023). CREATEd—Collaboration, Research Equity, and Action Together—is a program designed to prepare individuals to promote strong, equitable relationships among the research and practice communities. In part, our work consists of offering a year-long fellowship centered around developing individual capacity to facilitate the exchange of knowledge among researchers and research users to support evidence-informed and equity-centered policy and practice. The fellowship consists of online modules, live workshops, and opportunities for fellows to apply their learning.

Along with investment in K* arises a need for evaluations to provide information for funders and program managers to determine if a program should continue, improve, end, or scale up, thereby ensuring the efficient and effective allocation of resources (Rycroft-Smith, 2022; Hartling et al., 2021). As such, CREATEd has embedded evaluation into all our activities with the goal of capturing lessons to continuously improve our work and to document and measure progress toward achieving our goals. Using evaluation methods (Matthews & Simpson, 2020) and metrics (Barwick et al., 2020) can indicate whether training goals were achieved. However, there is no consensus regarding which outcome indicators or evaluation methods to use. To help inform our evaluation, as well as contribute to the literature, we conducted a scoping review to synthesize information on the method and outcome indicators used to evaluate related training programs and identify areas for improvement in current training evaluation approaches.


This review was based on Arksey and O’Malley’s (2005) systematic scoping review methodological framework, which consists of five stages: (1) identifying the research question, (2) identifying relevant studies, (3) study selection, (4) charting the data, and (5) collating, summarizing, and reporting the results.

Identifying the research question

The population of interest for this review was K* professionals who work to connect people and ideas across research, practice, and policy communities. We included practitioners, policymakers, researchers (and current graduate students), community members, and knowledge brokers in our definition of K* professionals. The intervention of interest was any training or capacity building activity related to K*. As the purpose of this study is to identify the methods and outcomes used for training evaluations, study design and outcomes of the included studies were left intentionally broad.

Identifying relevant studies

We conducted a search on articles published before August 2022, in eight multidisciplinary electronic databases: ProQuest, ScienceDirect, JSTOR, EBSCO, PubMed, Web of Science, Academic OneFile, and Scopus. In addition, we searched for relevant gray literature in Google Scholar. Finally, we reviewed reference lists of included studies and relevant reviews to identify additional articles. The search terms are those defined in Supplemental File B.

Study selection

All citations were imported into Excel and duplicate citations were removed manually. A two-stage screening process for eligibility was conducted. Articles were eligible for inclusion if they met each component of the inclusion criteria and did not have any criteria for exclusion (Table 1). Studies were not excluded based on year of publication, country of publication, type of publication, field of publication, or quality of publication.

Table 1 Inclusion and exclusion criteria for scoping review.

We retrieved a total of 1297 citations from the systematic searches of the eight multidisciplinary databases, the Google Scholar search engine, and review of reference lists. After removal of the duplicates, 824 articles were independently scanned by three researchers based on their title and abstracts. Results from the researchers were compared and discrepancies were discussed to arrive at agreement. In instances where the researchers had varied opinions of the inclusion of a specific article for the next stage, the majority decision prevailed. After this first round of screening, a total of 127 resources were processed for full-text review. Disagreements among reviewers in the full-text screening phase were reconciled by discussion and consensus. Resources that could not be obtained for full-text review through online databases, library searches, or through direct contact with the study’s first author were excluded from the final analysis. In addition, as described in Table 1 above, literature reviews were excluded from inclusion in the scoping review, however the reference lists of reviews were scanned to identify eligible studies. Furthermore, the reference lists of all eligible studies were reviewed to identify additional sources for inclusion. In total, 47 documents met our criteria for inclusion in the scoping review (Fig. 1). For a full list of the included publications, see Supplemental File C.

Fig. 1: PRISMA diagram.
figure 1

The PRISMA diagram depicts the flow of information through the distinct phases of the scoping review. It maps out the number of records identified, included and excluded, and the reasons for exclusions.

Extracting and analyzing data

We employed scientometric and content analysis to examine the included resources. The scientometric analysis provides a comprehensive overview of the included studies by visualizing the relationships among articles, journals, keywords, citations, and co-citation networks (Chen & Song, 2019). VOSviewer version [1.6.19] was used to conduct the scientometric analysis (Van Eck & Waltman (2010)). Different parameters, including publication evolution over time, citation analysis for core publications, co-authorship analysis, bibliographic coupling analysis, and finally, co-occurrence analysis were used to map the K* evaluation literature included for this review (Chen & Song, 2019). To view the settings for each of the analyses run in VOSviewer, see Supplemental File D.

We also conducted content analysis to provide more in-depth insights related to the methodological characteristics, outcomes assessed, and future evaluation approaches proposed by the authors of the included studies. We extracted relevant information from resources included in the final analysis using a standardized data extraction template. Supplemental File E summarizes data extracted and definitions used for categorizing data. Data extraction was performed by two researchers, whereas a third researcher checked the workflow for completeness and accuracy. Disagreements were resolved by consensus. Methodological information extracted included the type of evaluation conducted, methodological approach used, type of design (one- versus two-group), sample size, data collection techniques, and timeline of data collection. We categorized training outcomes based on the Kirkpatrick model. The Kirkpatrick model was first developed in the 1950’s to evaluate the effectiveness of training and educational programs and is still the most applied model to date (Alsalamah & Callinan, 2021b). It can be applied to any style of training, both formal and informal, to determine the efficacy of a training program based on four levels. The levels, in order, are reaction, learning, behaviors, and results, respectively. Each successive level of the model represents a more precise measure of the effectiveness of a training program. In the first level, the reactions of trainees are explored. This includes measuring the extent to which learners found the training to be relevant, engaging, useful, and enjoyable. In level two, the learning of participants is examined to understand whether learners acquired the intended knowledge, skills, attitude, confidence, and commitment because of the training. Level three focuses on the behavior of trainees after completion of the training. In this level, evaluators focus on measuring whether learners change their behaviors because of the training. Level four of the model looks at the downstream results or impacts that occur because of the training (Kirkpatrick & Kirkpatrick, 2006). Finally, future evaluation approaches proposed by the authors of the included studies were extracted and inductively analyzed to identify themes and characteristics.

Collating, summarizing, and reporting the results

It is important to note that “scoping reviews do not aim to produce a critically appraised and synthesized answer to a particular question, rather they aim to provide an overview or map of the evidence” (Munn et al., 2018). Therefore, the results of the included sources are described in the context of the overall aim of the review. Also, the aggregated findings provide an overview of the research rather than an assessment of the quality of individual studies.


To conduct a broad search of the published literature, we included eight multi-disciplinary databases, Google scholar, and scanned the reference lists of included articles. We recognize that we may have missed some K* training evaluations if the studies were not published or accessible online. In addition, as Shaxson et al. (2012) and other researchers have noted, there are many different terms used to refer to the processes and functions of connecting research and practice; this study may have omitted some terms from the search criteria and may therefore have excluded relevant studies. However, other reviews of K* (and related terms) have used similar search criteria as we have employed (e.g., Golhasany & Harvey, 2023; Mallidou et al., 2018; Murunga et al., 2020; Tait & Williamson, 2019). Therefore, while this review may not be exhaustive, it provides a comprehensive overview of the literature on K* training evaluations. As described above, 47 documents were included in the review. Due to the small sample size, caution must be exercised when drawing generalizations and inferences from the data. A further limitation in our study is that we did not analyze information about the structure of the training programs themselves or how the evaluations were used to improve the programs under study. We note that other scoping reviews have been conducted to investigate these aspects (e.g., Golhasany & Harvey, 2023; Mallidou et al., 2018). While such studies are useful to identify strategies to increase the evidence base in this area, the field currently lacks consensus on what outcome indicators or evaluation methods to use. Our review addresses this issue and provides a roadmap for methodological improvement of K* training evaluations.


Scientometric mapping of included studies

In this section, we report on the findings from the scientometric analysis. Different parameters, including publication evolution over time, citation analysis for core publications, co-authorship analysis, bibliographic coupling of documents, and keyword co-occurrence analysis are presented to map the bibliographic information from the included studies.

Publication evolution

Figures 2 and 3 illustrate the publication trends of the included studies. Most articles (i.e., 43 of 47) were published after 2012, accounting for 91% of the data sample. The top five journals contain 25 out of 47 items, representing 53% of the included publications. Implementation Science, a journal devoted to publishing articles on the implementation of evidence-based practices and programs in healthcare, has published the most articles focused on evaluating K* training programs, constituting approximately 36% of the publications. Worldviews on Evidence-Based Nursing, the Journal of Continuing Education in the Health Professions, The Pan African Medical Journal, and the International Development Research Centre round out the top five journals, each with two published articles.

Fig. 2: Publication evolution: percentage of articles over time.
figure 2

Figure 2 depicts the percentage of articles published over time. The y-axis has a percentage scale (0–50%), while the x-axis shows the years in which articles were published (2006–2022).

Fig. 3: Publication evolution: leading journals.
figure 3

Figure 3 depicts the percentage of articles that were published in different journals. The y-axis shows the journals in which articles were published, while the x-axis has a percentage scale (0–50%).

Citation analysis for core publications

To identify the most influential publications of K* training program evaluations, we examined the total citation counts (as of August 2023) for each article. The top five highly cited articles within our dataset are shown in Table 2. The most cited paper is Meissner and colleagues’ article The US Training Institute for Dissemination and Implementation Research in Health, with 147 citations. As shown in the table, the top five cited articles were all published within Implementation Science.

Table 2 The top five highly cited articles included in the review.

Co-authorship analysis: authors, institutions, and countries

Co-authorship analysis is used as a proxy for collaboration (Newman, 2004). The most collaborative countries, organizations, and authors on K* training evaluations are illustrated in Figs. 46. In Figs. 46, the larger each node (circle) is the higher the number of documents the corresponding country, institution, and author have. In addition, the thicker the link between the nodes, the more collaboration has occurred between them. The number of documents of an author, organization, and country was set to two (see Supplemental File D for more information on VOSviewer settings). Of the 241 authors in the sample, 30 met the thresholds set in VOSviewer. The author’s co-authorship map has 30 nodes, five clusters, 117 links, and a total link strength of 220. Ross Brownson from Washington University collaborated on the highest number of publications (n = 7), followed by Enola Proctor from Washington University (n = 5), Karen Emmons from Harvard University (n = 4), and Sharon Straus from the University of Toronto (n = 4). Figure 4 shows the time-based overlay visualization of collaborative relationships amongst authors based on the number of author publications and average-publication-year. Figure 4a (left) shows the visualization for the entire collaboration network (n = 30), while the Fig. 4b (right) figure shows a ‘zoomed-in’ view of the largest set (n = 18) of connected items.

Fig. 4: Collaboration network between authors.
figure 4

Figure 4 shows the time-based overlay visualization of collaborative relationships amongst authors based on the number of author publications and average-publication-year. a (left) shows the visualization for the entire collaboration network, while b (right) zooms in to show the largest set of connected items.

Fig. 5: Collaboration network between organizations.
figure 5

Figure 5 shows the time-based overlay visualization of collaborative relationships between organizations based on the number of publications per organization and average-publication-year. a (left) shows the visualization for the entire collaboration network, while b (right) zooms in to show the largest set of connected items.

Fig. 6: Collaboration network between countries.
figure 6

Figure 6 shows the time-based overlay visualization of collaborative relationships between countries based on the number of publications per country and average-publication-year. a (left) shows the visualization for the entire collaboration network, while b (right) zooms in to show the largest set of connected items.

Of the 100 organizations in the sample, 15 met the thresholds set in VOSviewer. The institution co-authorship map has seven nodes, six clusters, 21 links, and a total link strength of 33. Washington University collaborated on the highest number of publications (n = 8), followed by the National Cancer Institute (n = 4), McMaster University (n = 4), and St. Michaels Hospital (n = 4). Figure 5 shows the time-based overlay visualization of collaborative relationships amongst organizations based on the number of publications by organizations and average-publication-year. Figure 5a (left) shows the visualization for the entire collaboration network (n = 15), while Fig. 5b (right) shows a ‘zoomed-in’ view of the largest set (n = 12) of connected items.

Of the 27 countries in the sample, seven met the thresholds set in VOSviewer. The country co-authorship map has seven nodes, five clusters, three links, and a total link strength of five. The United States collaborated on the highest number of publications (n = 18), followed by Canada (13). Figure 6 shows the time-based overlay visualization of collaborative relationships amongst countries based on the number of publications by country and average-publication-year. Figure 6a (left) shows the visualization for the entire collaboration network (n = 7), while Fig. 6b (right) shows a ‘zoomed-in’ view of the largest set (n = 4) of connected items.

Bibliographic coupling of documents

For a better understanding of the extent to which the 47 documents in our sample shared references in common (Van Eck & Waltman, 2014), we constructed a bibliographic coupling network. The bibliographic coupling map has 47 nodes, 17 clusters, 247 links, and a total link strength of 569. The three studies with highest link strength (i.e., articles with the highest number of references in common with other articles) are Moore et al. (2018), Padek et al. (2018), and Brownsen et al. (2017), with a total link strength of 99, 90, and 89, respectively. While the three studies with the highest number of citations were Meissner et al. (2013), Straus et al. (2011), and Stamatakis et al. (2013), with total link strengths of 34, 20, and 24. The time-based overlay visualization of the bibliographic coupling analysis is presented in Fig. 7. The visualization reveals that a major cluster of coupling strength exists, which is predominantly composed of articles published between 2009 and 2015. Figure 7a shows the visualization for the entire bibliographic coupling network (n = 47) while Fig. 7b shows a ‘zoomed-in’ view of the largest set (n = 35) of connected items.

Fig. 7: Bibliographic coupling network.
figure 7

Figure 7 shows the time-based overlay visualization of the bibliographic coupling network based on the number of documents with shared references in common and average-publication-year. a (left) shows the visualization for the entire bibliographic coupling network, while b (right) zooms in to show the largest set of connected items.

Keyword co-occurrence analysis

Keyword co-occurrence analysis provides a description of the principal areas of focus in the research field (Chen & Song, 2019). In total, 227 keywords were used to describe the documents included in the review. Keywords that were used more than once were selected to map the network. The co-occurrence network of keywords is mapped in Fig. 8. In total, 56 keywords were mapped. The top ten keywords with the highest total link strength were knowledge translation, implementation, dissemination, capacity building, science, training, evaluation, education, mixed methods, and implementation science.

Fig. 8: Keyword co-occurrence network.
figure 8

Figure 8 shows the time-based overlay visualization of the keyword co-occurrence network based on the keywords used and average-publication-year.

Findings from qualitative content analysis

In this section, we report the methodological characteristics and outcomes assessed in the included evaluation studies. In addition, we summarize the future evaluation approaches proposed by the authors of the included studies.

Methodological characteristics of evaluations

We were interested in capturing the methodological details of the evaluation studies included in the review (Table 3). Most evaluations were designed to examine process and outcome variables. Authors also used other terminology to describe their evaluations, including comparative evaluation, narrative evaluation, pluralistic evaluation, participatory evaluation, collaborative autoethnography, and environmental scan. When considering the benefits of the training programs, costs of training were not assessed. Evaluations were most likely to include one-group designs over two-group designs. Most of the included studies either did not report their sample size, or their sample size was small, which lowered the study’s statistical power to detect true treatment effects. A wide range of data collection techniques were used, including surveys, interviews, document analysis, focus groups, participant feedback, and critical reflection. Researchers used several types of surveys to obtain data from trainees, including reaction surveys, pre- and post-competency surveys, and network surveys. Data collection most often occurred before the intervention and within 1-month immediately after the intervention. Long-term data collection (+1-month post-training) was less likely to occur across the included studies.

Table 3 Methodological characteristics of studies.

Outcomes assessed in the included studies

Authors were most likely to report measuring trainees’ learning (n = 38, 80.9%) and reactions (n = 37, 78.7%). Many studies also measured the behavior change (n = 32, 68.1%) of trainees after they finished the training and returned to their jobs. Less common were studies that examined the downstream results (n = 20, 42.6%) that occurred because of the training. In the following sub-sections, we summarize the types of data collection techniques and outcome indicators used in relation to each level of the Kirkpatrick Framework.


While evaluators typically used self-report surveys (utilizing both Likert and open-ended questions) to collect data on trainees’ reactions to the training, qualitative approaches such as interviews and focus groups were also utilized. Authors used several approaches for reporting this data, including use of descriptive statistics such as mean, mode, and range to analyze survey responses, while qualitative data was grouped and analyzed thematically. To gauge trainees’ reactions to the training, evaluators examined the level of satisfaction with the training program and its specific components (Oronje et al., 2022; Salloum et al., 2022), the extent to which the curriculum is clear and well organized (Astle et al., 2020; Lo Hog Tian et al., 2022; Morrato et al., 2015), the format of the training program (Froese & Montgomery, 2014; Gaid et al., 2022; Greenhalgh & Russell, 2006; Hess et al., (2013)), the competence of trainers (Brownson et al., 2021; Cunningham-Erves et al., 2021; Dagenais et al., 2015), the value of cohort-based learning (Brownson et al., 2017), trainees’ level of engagement (Brownson et al., 2017), and the usefulness and relevance of the training to the trainees’ actual job performance (Jones et al., 2015; Meissner et al., 2013; Olejniczak, 2017; Provvidenza et al., 2020; Vinson et al., 2019). In addition, trainees were also asked to describe the key strengths of the program and provide suggestions for improvement (Moore et al., 2018; Rakhra et al., 2022).


When examining ‘learning,’ evaluators explored trainees’ changes in knowledge or skills and/or changes in confidence and commitment to perform new K* knowledge and skills. Data collection approaches included the use of interviews, focus groups, observation, student data, and self-report surveys. Survey data were analyzed using mean ratings and tests of significance (e.g., Mbuagbaw et al., 2014). Depending on the evaluation, items were analyzed individually or were grouped into subscales to assess the underlying constructs (e.g., Proctor et al., 2019). While pre- and post-competency surveys were typically used to measure the change in participants’ knowledge and skills, some evaluations only used post-competency assessments. Post-competency assessments typically occurred immediately following the completion of the training program. However, to measure trainees’ sustained competence, some evaluations re-surveyed trainees at a later point in time. For example, Park et al. (2018) conducted interviews, focus groups, and surveyed individuals at baseline (pre-training), during training, and 6- and 12-months post-training to capture participants sustained knowledge in K* and self-efficacy in performing new K* skills.


Several different techniques were used to collect data on trainees’ changes in behavior. For example, Santacroce et al. (2017) used student data; Marriott et al. (2015), Morrato et al. (2015), Ndalameta-Theo et al. (2021), Vinson et al. (2019), and Meissner et al. (2013) used self-report surveys, and Hilbig et al. (2013) used interviews to gather data on participants’ activities after taking part in their training program. Evaluators examined the extent to which trainees accessed resources, engaged in K* focused activities, and/or influenced the thinking of colleagues. Some (Brownson et al., 2021; Luke et al., 2016; Morrato et al., 2015) evaluators were also interested in examining the development of collaborations and partnerships between trainees. In these instances, evaluators used social network surveys to collect data on different types of relationships.


As we previously noted, evaluators were less likely to report on downstream outcomes and impacts that occurred because of the training. Of those who did, many continued to rely on self-report surveys to capture result data (e.g., Carlfjord et al., 2017). However, other approaches were also utilized. For example, Baumann et al. (2020) used bibliometric analysis to understand the extent to which trainees had increased publications and grant funding compared to a control group. Kho et al. (2009) utilized participant feedback to understand how training affected participants’ employment. In addition, Luke et al. (2016) used social network analysis to examine the extent to which post-training collaborations were sustained over time. Finally, evaluators also used qualitative approaches to capture perceived changes to organizational processes, structure, culture, and obtainment of organizational goals (Clark et al., 2022; Provvidenza et al., 2020; Vinson et al., 2019).

Future evaluation approaches proposed by authors

Common limitations noted by authors included the use of simple evaluative designs, small cohorts/sample sizes, only evaluating short-term outcomes, and lack of curriculum evaluation activities. Of the included studies, 33 (70.2%) proposed future evaluation approaches for overcoming the current challenges associated with evaluating K* training programs.

Several authors (Breen et al., 2018; Brownson et al., 2017; Clark et al., 2022; Dagenais et al., 2015; Goodenough et al., 2017; Jacob et al., 2020; Jessani et al., 2019; Morrato et al., 2015; Norton, 2014; Rakhra et al., 2022; Straus et al., 2011; Uneke et al., 2018; Wahabi & Al-Ansary, 2011) reported that stronger evaluative designs are needed. Goodenough et al. (2017) call for multi-variate repeated measure designs, Clark et al. 2022 suggest the use of a control group, while Jacob et al. (2020) suggest the use of combined evaluation approaches to fully understand the impact of program activities. Norton (2014) and Jessani et al. (2019) also reported the need for the use of both pre- and post-measures to examine training outcomes. Relatedly, the need for more rigorous and standardized measures to evaluate the outcomes of training programs were highlighted by Jacob et al. (2020) and Wahabi and Al-Ansary (2011). Goodenough et al. (2017) and Stamatakis et al. (2013) also report that sufficiently large sample sizes are needed to ensure statistical power. Finally, Dagenais et al. (2015) argued that every component of a training program’s theory of action and/or logic model should be evaluated to explain the effects obtained.

Authors noted that there was a need for future evaluative activities to examine the longer-term impact of training activities (Baumann et al., 2020; Clark et al., 2022; Froese & Montgomery, 2014; Gerrish & Piercy, 2014; Luke et al., 2016; Moore et al., 2018; Murong & Nsangi, 2019; Padek et al., 2018; Park et al., 2018; Provvidenza et al., 2020; Ramaswamy et al., 2019; Salloum et al., 2022; Uneke et al., 2017; Uneke et al., 2018). It was suggested that longitudinal (Moore et al., 2018; Park et al., 2018; Provvidenza et al., 2020), time series (Clark et al., 2022), or stepped wedge (Clark et al., 2022) designs may be appropriate approaches for measuring long-term impact and behavior change. Park et al. (2018) recommended that future evaluations expand outcome assessments to consider ‘spillover’ effects of participants engaging in additional training opportunities outside of the training program being studied. Similarly, Baumann et al., 2020 suggested that evaluators consider opportunity for behavior change within participants’ local contexts. Other authors suggested conducting longer-term evaluation activities that examine training outcomes by participants’ discipline/field, changes in collaboration with stakeholders (through conducting social network analysis), and the effects training had on participants’ employment or position. The use of case studies and qualitative analysis was suggested by Padek et al. (2018) as a potential way to provide more robust feedback on the overall impact of the training program on individual participants.

The authors also highlighted the fact that their current evaluations did not measure the extent to which the various components of the training program produced desired results (Baumann et al., 2020; Goodenough et al., 2017; Olejniczak, 2017). As such, it was suggested that future evaluations assess the relative effectiveness of different training components. Further, Goodenough et al. (2017) suggested that future evaluations examine which individuals might be the best target of training.


An increasing number of institutions offer K* training programs to researchers, practitioners, and other stakeholders, thereby potentially providing them the opportunity to ensure findings from research are useful, useable, and utilized. Given the investment in these programs, evaluations have been conducted to identify the effectiveness of K* training programs. To inform our own evaluation of a K* training program, we aimed to understand how other K* training programs were being evaluated. In this section, we provide a summary of the scientometric and content analyses findings, followed by practical implications for evaluators and staff of K* training programs.

Summary of scientometric analysis findings

The findings from the scientometric analysis suggests the concept of K* training is still quite young, and literature regarding the evaluation of K* training programs started to appear in the mid to late 2000’s. As such, the number of documents included in this review is small. We note that the literature has grown after 2012 and given the development of the K* field and increasing calls for capacity development in this area, it is reasonable to expect that the growth of the literature will continue. At present, most publications come from the fields of health and implementation science. However, as the field begins to mature, we expect to see researchers from different disciplines (e.g., education and other social science and humanities disciplines) contribute to building the literature base. As the concept of K* training is relatively new, it is understandable that collaboration amongst authors was not widespread. However, limited collaboration may result in a lack of sharing knowledge and resources, thereby resulting in K* training program developers and evaluators risking “re-creating the wheel.” As the field continues to develop, we encourage K* program staff and evaluators to connect and collaborate with others engaging in similar types of initiatives.

Summary of content analysis findings

Findings from the content analysis revealed that process and outcome evaluations were the most applied evaluation designs, while commonly used data collection techniques included surveys and interviews. Many of the authors of the included studies recognized the inherent limitations of their evaluations and pointed out issues with small sample sizes, lack of long-term follow-up, and difficulties in measuring long-term impact.

Most studies assessed the ‘reactions’ of trainees, with evaluators using surveys to obtain feedback from participants. This is likely because surveys can be conducted easily after training sessions. Surveys typically consisted of Likert-style questions, coupled with open-ended items to better understand why trainees might value different program dimensions. The reaction level was typically measured using multiple dimensions (Alsalamah & Callinan, 2021a), such as quality of training content, delivery methods, cohort development opportunities, quality of the trainer, and flexibility and accessibility of training approach. Measuring trainees’ reactions are important as “both positive and negative comments can be used to modify the program and to ensure…support for the training program” (Reio et al., 2017). In addition, understanding data captured at Level 1 can form the basis for analyzing subsequent levels of training evaluation. For example, Level 1 reaction data may reveal barriers that impede trainees’ learning (Level 2). However, Reio et al., 2017 go on to explain that “favorable reactions to the training do not, by themselves, guarantee that learning (Level 2) or improved performance (Level 3) has occurred,” and as such, evaluators must also capture data on trainees’ learning and behavior, as well as the downstream impacts that occurred because of the training.

A substantial proportion of studies also assessed the ‘learning’ of trainees, with many evaluations using non-experimental (i.e., no control group) pre- and post-intervention designs. Pre- and post-intervention questionnaires were often self-report instruments instead of direct measures for assessing capacity. However, we also found several evaluations in which only post-program surveys were conducted to assess participants’ skill and knowledge development. Our findings revealed that over two-thirds of studies assessed behavior change. Behavior change was often measured by evaluators between 6- to 12-months post intervention through self-report methods, such as through surveys and interviews. Assessing downstream results were less likely to be evaluated in K* training programs. As a part of these evaluations, evaluators assessed changes in trainees’ outputs (e.g., increased publications or grant funding) or broader organizational changes that occurred because of trainees participating in the training intervention.

Recommendations for future K* training evaluations

Based on our experience reviewing the evaluation of K* training programs, we offer several recommendations to future evaluators and program staff who choose to pursue this line of work.

Increase overall rigor of evaluations

To maximize the rigor of evaluation studies, we recommend the use of a logic model or theory of action to guide the development and conduct of studies (Dagenais et al. 2015). Relatedly, evaluators and program staff should clearly identify and evaluate the linkages between the training program components and its outcomes. Evaluators and program staff should also include both process as well as outcome (i.e., impact evaluations) components in the evaluation design. Additional evaluative components that include cost-benefit analyses and curriculum evaluations can also be included to justify programmatic action.

Improve the soundness of research designs

Future evaluations of K* training programs can improve on their research designs by utilizing pre- and post-intervention designs. We note that evaluators can use pre- and post-tests in two ways – through self-report surveys or through competence assessments. Self-report surveys require participants to state their perceived level of competence in a domain. Self-report surveys can accurately assess self-efficacy and aptitude and are usually inexpensive and easy to use. However, we note that they may not be the best method to assess dynamic processes such as comprehension, and they may suffer from self-reporting and recall bias (Paulhus & Vazire, 2007). Competence assessments require, as a minimum, that trainees can show evidence of competence in K* (e.g., by answering questions on a test). Competence assessments provide a more objective and relevant measure of performance, however, there are potential concerns regarding the possibility that participants in varying settings may have different time available to study for the test, and there may be differences in professional exposure to concepts covered on the test (Grissom et al., 2019). We recommend that evaluators and program staff weigh the benefits and challenges related to self-report and competence assessments and use the type of pre/post-test that works best for their own evaluations. Future evaluations of K* training programs can also improve on their research designs by triangulating data (i.e., collect multiple indicators of the same outcomes). In the current review, self-report measurements of learning and behavior changes were predominately used by evaluators. While this method is simple, it also presents issues related to self-reporting and recall bias. Additionally, if trainees perceive the self-assessment as being linked to performance management, results may be skewed. As such, self-report measures are best used in conjunction with other methods to reliably measure behavior and learning change (Hagger et al., 2020). For example, evaluators can use 360° feedback, whereby a small number of behaviors are assessed by trainees, direct supervisors, and other stakeholders to examine trainees’ performance (Kanaslan & Iyem, 2016). We also recommend that future evaluations choose a suitable sample size to detect true treatment effects; however, we recognize that obtaining an adequate sample size can be challenging for training programs due to resource limits and potential issues that may arise related to recruiting, enrolling, and retaining participants (Avellar et al., 2017). Finally, where possible, we recommend that future evaluations include two-group designs (i.e., treatment and control groups).

Evaluate impact using contribution analysis

First, future studies are needed that include and evaluate Level 4 Kirkpatrick evaluation criteria (i.e., downstream results). However, it should be noted that attributing changes in downstream results, such as improved relationships between the research and practice communities, are challenging to assess as they are multifactorial and complex. Other factors, such as national and state legislation on research use, additional training opportunities attended by trainees, and trainees’ organizational contexts may also contribute to better K* outcomes. We point to contribution style approaches (e.g., Kok & Schuit, 2012; Morton, 2015) as a potential way in which evaluators can address issues of attribution in future evaluation studies. Contribution analysis is a theory-based evaluation approach that provides a systematic way to arrive at credible causal claims about a program’s contribution to change (Mayne, 2008; 2012). The approach involves developing and assessing the evidence for a logic model to explore the program’s contribution to observed outcomes. The approach is particularly useful in situations where an experimental (i.e., two-group) design is not feasible (Mayne, 2008; 2012). The findings from a contributions analysis do not provide definitive proof that a program attributed to outcomes but allows evaluators to draw a plausible conclusion that the program has contributed to documented results (Mayne, 2008; 2012).

If possible, longitudinal data should be collected

We recommend that longitudinal data be collected on K* trainees to examine the effects of K* training programs over time. For example, evaluators can follow-up with trainees at pre-determined time periods (e.g., 6-, 12-, 18, and 24-months post training) to examine Level 3 (behavior) and Level 4 (results) outcomes.

How the scoping review has informed CREATEd’s evaluation

As argued by Dagenais et al. (2015), a key component of good evaluation planning is the use of a theory of action (ToA) to anchor the implementation of the program to its objectives and intended outcomes, and to provide a basis to formulate the questions and evaluation needs to address. As such, over the period of several meetings, the CREATEd team collaboratively developed a ToA that we use to shape the work of CREATEd and guide evaluation activities. We use an evaluation framework to organize and link relevant ToA outcomes to evaluation measures, evaluation questions, data collection tools, data sources, data analysis procedures, and the year(s) in which data collection will occur. This framework is updated and refined on a yearly basis to ensure that the evaluation team continues to gather data that reflects’ CREATEd’s priorities. To evaluate CREATEd initiatives, the evaluation team collects data corresponding to its’ initiatives, and its’ short, intermediate, and long-term outcomes. We provide a summary of each of the tools and/or approaches that we use to collect data in relation to the Fellowship.

We use multiple tools and/or approaches to collect data on the Fellowship. For example:

  • Project Data Records: we collect data on the number of applications, number of fellows that participate, and fellows’ demographic information.

  • Module/Workshop Reaction Surveys: this survey asks about fellows’ thoughts and opinions about the online modules and workshops they complete during the Fellowship.

  • Pre- and Post-Training Surveys: this survey asks about the extent to which fellows have key knowledge and skills across key competency areas. Fellows complete the survey both pre- and post-engagement in the curriculum.

  • Interviews: in a series of interviews, the CREATEd evaluation team asks about fellows’ experiences and opinions on engaging in the fellowship.

  • Observations: during the fellowship program, fellows are asked to facilitate event(s) that include diverse stakeholders in the education community. A member of the CREATEd evaluation team observes the event(s) and record notes of what they see and hear.

  • Annual Alumni Follow-Up Survey: this survey asks fellows about their professional experiences after completing the fellowship program and includes social network items to assess the development and sustainment of fellows’ social ties. Fellows complete the survey once a year for up to three years following their graduation from the Fellowship.

CREATEd program staff are committed to using evaluation data to foster program improvement. As such, we regularly engage in reflection exercises to map the evidence we have collected onto the components of the ToA to assess the extent to which our program has contributed to outcomes following the intervention.


The scoping review presents a comprehensive assessment of the K* training evaluation literature. The review was conducted with the required rigor and transparency advocated by Arksey and O’Malley (2005). The review is comprised of two parts. First, a thorough scientometric analysis was conducted to present the scholarly networks and the research trends. The scientometric analysis identified influential articles, authors, and collaboration networks. The scientometric analysis was followed by a qualitative content analysis to synthesize the methods applied and outcomes assessed within the respective articles. The evidence presented in this review contributes to discussions about how the K* training evaluation literature has grown and is changing over time. We believe the findings from this scoping review will be of interest to evaluators and program designers and will help inform the design of future evaluations of K* training programs.