Executive functions (EFs) refer to a set of cognitive processes that allow for goal-directed behavior through the regulation of various cognitive subprocesses. Since EFs permeate behavior, they also impact daily activities as well as social and personal development, including school or job success1. The importance and pervasiveness of EFs has led different fields of study to investigate these control mechanisms with the goal of differentiating the various subdomains of EFs. This, in turn, has resulted in a number of different conceptualizations based on different approaches, all attempting to subdivide EFs into different domains. While a consensus does not yet exist about how exactly to subdivide and name EFs, there is general agreement that there are three core EFs: (1) cognitive flexibility, (2) working memory and (3) inhibition1 (but see Karr et al.2,3). Higher-order EFs, such as reasoning, planning and problem solving, are then built on the basis of these subdomains.

The various sub-domains of EFs have been shown to be impaired in a number of neurological and psychiatric diseases, such as attention-deficit/hyperactivity disorder (ADHD)4, Parkinson’s disease5, depression6 and schizophrenia7. Different diseases present their own typical EF deficits and clinical diagnosis attempts to assess the specific patterns of the disease. For example, in the case of Parkinson’s disease, patients suffer from difficulties in dual-tasking which is reflected in the deficient combination of memorizing and manipulation of thoughts and tasks8 but also in impaired speech characterized by semantic paraphasias and reduced word fluency due to a lack of EFs5,9. To assess these symptoms different test batteries have been developed. These batteries, which include tests tapping into the different EF sub-domains, are used for neuropsychological assessment in both clinical settings and lab-based environments. Commonly used batteries are the Delis-Kaplan Executive Function System (D-KEFS) and the Vienna Test System, both of which offer a wide range of tests probing each of the EF sub-domains and have been independently validated10,11,12. Commonly used tasks tapping into the different sub-domains of EFs comprise the Wisconsin Card Sorting test (WCST), Tower of London (ToL) and Trail-Making test (TMT) to assess cognitive flexibility5,13,14, n-back tasks and the Corsi block tapping test to cover the sub-domain of working memory15,16 and the Stop-signal task or the Stroop test (color-word interference) to probe the sub-domain of inhibition. All of these are commonly used in the clinical5,17 as well as in the scientific context14,18. Importantly, due to the overlap of the different domains of EFs, these tests cannot be assumed to target one specific domain of EFs only.

Beside subdomain-specific EF tests, clinical and research test batteries also include a speech-based task, namely the verbal fluency (VF) task but the explicit involvement of different EF subdomains in the VF task is reported controversially especially when considering inter-individual differences19. The well-established and often-used VF test assesses the number of words generated in a given time (usually 60 seconds) and has been found to be a sensitive measurement for testing EFs in both non-clinical groups19 as well as in neurological patients20. VF tests mainly comprise two types of tasks: The phonemic/lexical VF, requiring the generation of as many words as possible beginning with a specific letter (e.g. C, F); and semantic VF, in which the examinee is asked to produce words that belong to a specific semantic category (e.g. fruits, animals). Additionally, most VF tests include a switching task in which words from two different categories are produced in an alternating order21.

The relationship between VF and the various subdomains of EFs has frequently been investigated in both healthy controls19,22,23,24 and patients17,25. Concerning the relationship between VF and working memory, some studies showed that better working memory performance leads to less perseveration errors26 and a higher total score of produced words in the VF task22,27. However, a clear link between working memory and VF performance has so far not been found19,28. Similarly, the relationship between VF and response inhibition is not clear yet. While some studies report lower scores in VF concomitating with a decline of inhibition performance22 other studies failed to find a link between VF and inhibition20,29. Finally, regarding the relationship between VF and cognitive flexibility, studies report a positive correlation of switching between categories in VF tasks and cognitive flexibility performance21. However, there are also findings which indicate that there is no relationship between EFs and VF performance30.

Altogether, results concerning the relationship of EFs and VF are ambiguous. This might, at least partly, be based on inter-individual variability of both EFs and VF. For example, a pronounced effect of age was identified by multiple studies showing a significant negative correlation between age and the different aspects of EFs as well as VF performance31,32,33,34. Furthermore, fluid intelligence has been found to be related to the performance in EFs tasks tapping into the subdomains of planning and reasoning and35. In contrast, inhibition was shown to be independent of intelligence in children with problems performing attention tests36.

The complex involvement of EFs in VF performance has also been shown to be modulated by the influence of inter-individual variability like dynamically varying hormonal levels. Especially sex hormones like estradiol and progesterone have been shown to influence performance in EFs tasks37,38. It was shown that cognitive performance varies during the different phases of the menstrual cycle with high progesterone and estradiol levels leading to faster reaction times and better accuracy37,39. Moreover, cortisol, which is mostly associated with stress, appears to impact EFs, but literature addressing this topic is ambiguous. On the one hand, studies found a positive relationship e.g. between cortisol level and working memory40 or cortisol level and performance in cognitive flexibility tasks41 in men. On the other hand an inverse relationship was found in cognitive flexibility tasks in women41 and in working memory performance42. Additionally to the influence on EFs varying hormonal levels could be also linked to VF performance43. To investigate the role of varying hormonal levels studies implement different procedures: While some inject specific hormones and assess the change of cognitive functions due to this injection44,45 other studies analyze intra-individual differences measuring the naturally varying hormonal level at different points of time37,40. Test data of EFs and VF are commonly analyzed with classical statistical methods. For example, correlation analyses have been previously used to investigate the relationship of VF and specific subdomains of EFs46. Other studies have investigated group differences between patients and healthy controls to e.g. examine sex differences in VF strategies19 or to explain the relationship of memory and VF in patients with Alzheimer’s disease28. Furthermore, factor analysis has also been applied to investigate common cognitive structures of VF performance24,30.

Considering previous literature investigating the relationship of EFs and VF in more detail it is obvious that each work contributes to a better understanding of this relationship but generalizing this knowledge is still difficult. Specifically, these limitations are e.g. due to the small subject size or reduced EF test batteries which does not represent overall EF performance. Generalizability is also restricted due to the applied methods. All the above-mentioned methods are applied to investigate within-sample effects to understand the theoretical hypothesis-driven neuropsychological relationship between VF performance and EFs. However, it is so far unclear to what extend VF task performance reflects the different subdomains of EFs. To address this question, more advanced statistical methods should lead to a more detailed insight into the complexity of VF performance. Machine learning models can be used to characterize complex behavior with the ultimate goal of identifying and predicting psychiatric diseases47,48. In contrast to classical statistical analyses, these prediction analyses use large sample sizes and a high number of variables as well as a cross-validation approach by training a model on part of the dataset and then validating it on unseen data. Applying machine learning methods on a wide variety of EF tests enables to capture the complex and non-linear relationship of EFs and VF performance.

To contribute to a deeper understanding of the so far inconclusive relationship between EFs and VF, the present study used a machine learning approach to investigate to what extent VF performance can be explained by subdomain-specific EF tests. We hypothesize that VF performance can be explained by a conglomeration of cognitive flexibility, working memory and inhibition test scores, which is further modulated by individual variations of fluctuating hormone levels.



The age of the 253 healthy participants was ranging from 20–55 years (mean age 35.3 ± 11.0, 99 males). Participants were monolingual German speakers and received different levels of education (finished middle school: 10, professional school/job training: 70, finished high school with a university-entrance diploma: 76, university degree: 97). Participants were recruited in North Rhine-Westphalia (Germany) via social networks and the Forschungszentrum Jülich mailing list. Testing sessions took place at the Forschungszentrum Jülich, with a duration of 150–180 minutes depending on the time needed for instructions and the speed with which the participants passed the tests. A remuneration fee of €50 was paid.

Data collection

Data was collected by four different examiners, all of whom conducted several pilot testings and were instructed by the study leader to ensure a common standard. The examiner gave standardized instructions before starting each test and help was provided by the examiner whenever the participant had any questions regarding the instructions or tests. The testing session included 13 EF tests and 3 semantic VF tasks. The EF test battery consisted of computerized versions of neuropsychological tests covering domains of inhibition, working memory and cognitive flexibility. Ten of these tests were taken from the Vienna Testsystem test battery and three were designed with PsyToolkit49. In this study, we assessed commonly used EF tests like the Stroop and TMT. We used a broad selection of EF tests to cover all subdomains of EFs and to detect most influencing tests and their variables. A complete list of the tests is shown in Table 1.

Table 1 Overview of executive function test battery.

Results of the neuropsychological tests can be seen in Table 2.

Table 2 Neuropsychological data of participants.

The semantic VF tasks were based on the Regensburger Wortflüssigkeitstest50. This test is a standardized neuropsychological assessment that has been thoroughly tested for reliability, validity and objectivity50. Due to language-specific differences in the frequency and usage of letters and categories51 this German version of VF task was used. Two of the tasks were simple semantic VF tasks in which the participant had to name animals (t1) and jobs (t2). The third semantic VF task was a switching task in which the participant switched between fruits and sports (t3) within the same task. Each of the three tasks was performed for 2 minutes. The VF tasks were presented with Presentation software (Neurobehavioural Systems) and the participant’s responses were recorded automatically. Following the testing session, the recorded speech was transcribed and words were coded manually as being either correct answers or errors. The number of correct words were counted for each task (t1, t2, t3) and the sum score of total number of correct words across all three VF tasks was used in all further analyses. To broadly represent VF performance, the sum of all VF tasks was selected to include different aspects of the task. This variety of VF performance is beneficial to build a machine learning model which is complex enough to reflect the complex patterns of VF performance.

In addition to the main test set of EFs and VF tasks, phenotypical data was collected through questionnaires to gather information regarding the physical and psychological well-being of the participants. These questionnaires included the Beck Depression Inventory (BDI-II) (Beck, Steer & Brown, 1996) which was used to collect information regarding depressive symptoms. Saliva samples were collected at the beginning and at the end of the test session. The two saliva samples of each subject were sent to an external lab which pooled both samples before carrying out analysis for cortisol, progesterone, estradiol and testosterone. Additionally, the testing session also comprised further speech tests (word-picture interference task, picture description, spontaneous speech), for which results will not be reported here, as they will be independently analyzed. This additional data will then be described in a subsequent paper. Moreover, we aim to publish a data paper which will describe all aspects of data collection, test selection and testing procedure in detail while also making this data publicly available.

Collection and analyses of the data presented here was approved by the ethics committee of the Heinrich-Heine University Düsseldorf. We confirm that all experiments were performed in accordance with relevant guidelines and regulations. Moreover, informed consent was obtained from all participants.

Data analysis

The original dataset of 253 participants was reduced to 235 due to missing data of some participants (94 males; 101 participants were aged between 20–31, 70 between 32–43, 64 between 44–55). From all EF tests 72 variables (Supplement 1) were extracted based on the features provided by the Vienna Testsystem and PsyToolkit49. VF performance was represented by the sum score of correct words across all VF tasks.

Two independent analyses were computed. In a first analysis, Spearman correlations were computed to analyze the relationship of each EF variable and VF sum scores. Here, a reduction of the 70 EF variables was used. Specifically, EF variables were selected based on the EF test manuals provided by the Vienna Testsystem in 10/13 EF tests. In cases where multiple main variables were provided by the Vienna Testsystem, the main variable was selected based on previous literature investigating EF performance. In contrast to the Vienna Testsystem, tests run within Psytoolkit are not standardized and thus do not come with associated test manuals. Thus, the selection of main variables of tests designed with Psytoolkit (3/13) were selected based on previous literature.

Considering the influence of sex and age on the performance in EF and VF tasks19,22,34 data were adjusted for these variables by linear regression and analyses were computed with the residuals.

In a second analysis, the possibility of predicting VF from EF scores was investigated by applying supervised learning via a sparse (relevance vector machine; RVM) and non-sparse (partial least squares; PLS) model using 72 EF variables (Supplement 1). Generally speaking, sparse models aim to reveal a sparse structure and detect correlations among redundant features52. Specifically, RVM is based on the Support Vector Machine (SVM) but is a Bayesian sparse technique which allows for the prediction of a specific target value from a set of different features. In contrast, PLS is similar to principal components regression and is based on covariance. Results given in the main manuscript focus on the RVM approach, while results for the PLS analysis are given in the supplement. Sex and age were regressed out from VF score and from EF data in a cross-validation consistent way.

Before running the prediction analysis, data was transformed to z-scores. A 10-fold cross-validation was then performed for which the data set was randomly split into 10 sets, 9 of which were used for training while the 10th set was held back and used to perform the prediction in previously unseen data. Ten replications of the 10-fold cross-validation were performed and thus 100 prediction models were computed. Prediction performance was assessed by computing the correlation between real and predicted values.

Beside testing statistical significance of prediction performance, we also examined which specific EF features significantly impact prediction performance. To determine which EF features (EF test variables) contribute most strongly to the prediction, we employed an approximate permutation test procedure, in which associations between features (total set of EF variables) and labels (VF sum score of each participant) were randomized. That is, the VF performance score was randomly permuted while the feature matrix was kept unchanged. The RVM analysis was repeated for each permutation and accuracies for 100 permutations were used to construct an empirical null distribution for each feature, which was used to compute the statistical significance of the contribution of each feature as the proportion of permutated labels achieving a better prediction than then original labels.


Correlations between executive function scores and verbal fluency performance

The correlation analyses identified multiple significant results which are shown in Fig. 1.

Figure 1
figure 1

Plots of significant correlations of executive function tests and total verbal fluency sum score. The performance in verbal fluency task is represented by the total number of correct words produced across all three semantic VF tasks (t1 + t2 + t3).The negative correlation in plot b-f are due to the divergent direction of the scores since these variables describe different types of errors, reaction or process times (the higher the worse the performance) while the performance in the verbal fluency is represented by the total amount of correct items (the higher the better).

The highest negative correlation coefficient can be seen between the number of missed items in WAF-G and the VF performance (r = −0.21; p = 0.0009) indicating that a better performance in divided attention is associated with a higher VF score. Likewise, inhibition ability measured by the naming interference variable of the Stroop test (r = −0.20; p = 0.001) shows a negative correlation with the VF score. This result indicates that participants who successfully inhibited proponent behavior in the Stroop task perform better in the VF task. Additionally, abstract reasoning assessed with the Raven’s Progressive Matrices test (SPM) reveals a positive correlation (r = 0.19; p = 0.003) to VF performance indicating a demand of cognitive flexibility and planning while generating words from a specific category. Similar results were found for the TMT (r = −0.14; p = 0.029) and the number of perseveration errors in the WCST (r = −0.14; p = 0.032) which particularly reflect the involvement of cognitive flexibility and working memory in the VF task. Additional to the EF battery we also found a significant negative correlation of the VF tasks and the Cortisol level of the subjects (r = −0.13; p = 0.042).

Prediction of verbal fluency performance from EF scores

The correlation of true and predicted values was r = 0.28 (p < 0.0001) (Fig. 2).

Figure 2
figure 2

Correlation of true and predicted verbal fluency sum scores applying Relevance Vector Machine algorithm.

In order to quantify the contribution of the different EFs variables to VF performance, features with significant model weights in the approximate permutation test were identified. As can be seen in Fig. 3, 8 features belonging to 4 different EF tests and 2 hormones were identified.

Figure 3
figure 3

Features displaying strongest impact on prediction analysis.

The EF feature with the highest impact on the prediction analysis is “mean reaction times” of unannounced items of the spatial attention test WAF-R. The RVM analysis also revealed another “reaction time” feature of WAF-R which represents “reaction time” in items with a long stimulus onset asynchrony. The influence of attention on VF performance was also shown in a feature of WAF-G assessing the number of missed items in a divided attention test. These results show that participants reacting faster in attention tests also perform better in the VF task, identifying overall reaction speed and correctness as a central component in VF performance. Since WAF-R is not only assessing attention but also includes inhibitory requirements, these results highlight the role of attention and inhibition during VF performance. The explicit role of inhibition can be also detected in the variable “naming interference” of the Stroop test indicating that inhibition is an essential component to successfully produce words within or between two different categories. The analysis also revealed the predictive meaningfulness of cognitive flexibility and planning, by showing that “non-perseveration errors” in WCST and “process time” in SPM contribute essentially to the prediction analysis. The WAF-R was the only test presenting more than one variable represented in the most predictive features, both of which contain reaction time information. With regards to non-EF features, the RVM analysis also identified stress hormone cortisol and sex hormone estradiol as highly predictive variables (Fig. 3).

Corresponding results of the PLS analysis revealed a correlation of true and predicted values of r = 0.35 (p < 0.0001). However, in contrast to the results of the RVM analysis, approximate permutation test did not reveal any significant p-values identifying specific EF features. Detailed results of the PLS analysis are given in the Supplementary Material (Supplement 2).


The aim of the study was to elucidate to what extent VF performance can be explained by different subdomains of EFs and which types of EF variables contribute most strongly to the prediction of VF performance. In a first step, we correlated the different EF scores with the number of correctly produced words across the three semantic VF tasks. This analysis revealed significant correlations between SPM, Stroop, TMT, WCST, WAF-G, WAF-R and the VF task performance. These EF tests tap into two EF domains, namely cognitive flexibility and inhibition. We further investigated the relationship of EF scores and VF by prediction analyses to gain insight into the contribution of the different EF test variables. We showed that EF data predict VF performance and that beside cognitive flexibility and inhibition, reaction time and attention play important roles in predicting VF performance. Additionally, hormonal influences were identified as meaningful parameters to predict VF performance, highlighting the influence of inter-individual differences in VF performance. We first discuss the results in the direct context of the different EF subdomains cognitive flexibility, inhibition and working memory. Secondly, the involvement of EFs in the VF task is discussed in a more general context addressing the role of attention as well as the meaningfulness of reaction times. Finally, the influence of varying hormonal levels illustrates the impact of inter-individual differences on VF performance.

Multiple tests within the domain of cognitive flexibility were shown to be related to VF performance. The highest correlation was found for the SPM test followed by TMT and WCST. While the correlation analysis revealed a relationship of these tests with VF performance, the prediction analysis confirms the importance of the features describing errors in WCST. Additionally, the prediction analysis highlights the component of processing speed during SPM which was not identified by correlation analysis. In congruence with these results, previous studies have linked VF with cognitive flexibility21,53,54. Paula et al.21 investigated this relationship in healthy adults, using simple and switching semantic VF tasks and three different EF tests, including the TMT. They found that this particular measure of cognitive flexibility correlated well with both simple and switching VF tasks. The influence of cognitive flexibility on VF was also examined in a study by Troyer et al.54 who discussed the importance of cognitive flexibility assuming that two different abilities are needed for VF: (1) verbal memory for the creation of clusters and production of words belonging to a specific subcategory; (2) strategic search and cognitive flexibility which enables shifting between clusters54.

It should be noted that the present results concerning the SPM might have to be treated with caution since this test also encompasses aspects of fluid intelligence55,56,57. Due to the relationship between EFs and fluid intelligence58,59, it may not surprise that fluid intelligence also impacts VF performance as has been shown in studies on schizophrenia60 and bipolar disorder patients61 as well as healthy controls60.

Altogether, considering that three out of five cognitive flexibility tests contribute to VF performance, the present results point to a crucial influence of cognitive flexibility on VF performance, especially to cluster words and switch between categories.

In addition to the domain of cognitive flexibility, inhibition tests were also identified to play a role in VF performance. Specifically, both the correlation analyses and prediction analysis revealed that the naming interference of the Stroop test is related to VF. Previous studies report ambiguous results when investigating the relationship between inhibition and speech production. For example, a positive correlation between inhibition, assessed with a stop-signal task, and the reaction time in picture naming46 was found but could not be validated in VF tasks23. Discussing these ambiguous results, the authors suggest that while stop-signal tasks measure the participant’s ability stopping a planned response (response inhibition), VF tasks tend to involve the ability of suppressing the activation of competitive target responses (selective inhibition). In the present study selective inhibition was assessed by the Stroop test. In the naming subtask of the Stroop test the participant is asked to name the color in which the word is printed. Incongruent items, in which the color of the word does not match the written word evoke a longer reaction time, indicating that prepotent responses (i.e. the meaning of the written word) have to be suppressed. This is very similar to the kind of inhibitions that participants are challenged with in the VF task when needing to suppress words which have already been produced. In accordance with previous literature24, the present suggests that selective inhibition, specifically as reflected in the naming interference of the Stroop test, is a key parameter to drive VF performance. An alternative interpretation of the naming interference of the Stroop test and the total number of words produced in the VF task relates to the association between verbal processing speed and dominance of word reading. Individuals with a high verbal processing speed can be assumed to also show a stronger dominance of word reading then those with slower verbal processing. This stronger dominance of word reading, in turn, can be expected to go along with a stronger interference effect in the STROOP task, thus explaining the correlation between naming inference in the STROOP task and VF performance.

In addition to cognitive flexibility, working memory and inhibition we also investigated the role of attention in VF performance. While divided attention (WAF-G) was linked to VF performance in the correlation analysis, multiple variables of the spatial attention test (WAF-R) and divided attention test (WAF-G) contributed to VF performance in the prediction analysis.

Previous studies also described attention as a crucial cognitive function to perform VF54. In particular, it could be shown that divided attention particularly impacts the switching component of the VF tasks. At first glance, the influence of the spatial attention test on the speech task might be surprising since there are no spatial requirements in the VF task. Nevertheless, we assume that beside the component of attention per se the involvement of inhibition which is also part of this task might also have an effect on these results. With respect to the relationship of VF and divided attention, results are consistent with previous literature highlighting the influence of divided attention especially in the VF switching task54. To sum up, attention might be a crucial aspect for performing VF task. In particular, we hypothesize that attention is a fundamental and permanent cognitive requirement during updating the current status of already produced words as well as being efficient in producing words within or between two.

Surprisingly, the present study did not find a relationship between working memory and VF performance in both the correlation and prediction analysis. However, previous studies investigating the involvement of working memory in VF tasks also report ambiguous results: For example, one study assessing a digit-span and spatial-span test did not find a significant relationship between working memory and VF28. However, other studies have reported results that indicate updating of information and working memory performance have a high impact on VF scores23,62,63. Additionally, another study found the relationship between memory performance and VF to be specific to women19. The missing link between working memory and VF in the current study might be the type of variable which was selected to represent VF performance. Here, the sum of correctly produced words was used as the main measure of VF performance. While measuring VF performance, this variable does not contain information about the types of errors that occur during the VF task. Although word repetitions (perseveration errors) did not count as correct produced words this error type is not analyzed separately. However, perseveration errors are described as a sensitive indicator of working memory performance. Additionally, the relationship between working memory and VF performance measured with the total sum of words has so far been mainly investigated in patients or older participants22,27 but rarely in healthy controls. Thus, we assume that the number of correct produced words might be less meaningful to reflect working memory in healthy controls than error-specific parameters like perseveration errors.

In general, prediction results reveal specific EF tests and variables which are closely linked to VF performance. Besides differences in the strength of the relationship between certain EFs and VF, the EF test constructs themselves might also have partially influenced analysis. In particular, the reliability of some EF tests is discussed controversially64,65. Thus, some EF tests might not represent actual EF performance well and such a poor reliability of EF test might be reflected in the relationship of EF to VF as studied here.

Comparing correlation and prediction analyses, crucial differences in the results were observed. At first glance, the EF tests identified in the prediction analyses are similar to those of the correlation analyses but include additional variables. Specifically, the prediction analyses reveal a number of additional variables that measure how fast participants completed the tests and how many errors they made. While there is limited literature about the influence of processing speed in cognitive tasks on VF performance, some studies have addressed processing speed in general in the context of cognitive functions66,67. Another study found relationships between processing speed, working memory, inhibition and VF scores24. Additionally, poorer processing time has been associated with poorer cognitive performance in older adults66. The association between processing speed and EFs has been also observed in patients with depression67. In line with these findings, another study investigated the role of processing speed in schizophrenia patients and suggest that especially in working memory tasks assessing speed might be helpful to detect patterns of schizophrenia68. In respect to VF performance, processing speed was identified as being closely related to speech production31 and is reported as a predictor for VF in ageing69. Based on previous literature and our present findings, we assume that processing speed is a general aspect involved in both EFs tests and VF tasks. Particularly, it can be assumed that due to the time limit of 2 minutes in the VF tasks participants are zealous to name as many words as possible. This general behavior might also be relevant in EF tests. Thus, we suggest that processing speed and reaction times indicate that people acting fast in cognitive tasks also perform more successfully in VF tasks than participants thinking more in detail about their answer. Additionally, we assume that the complex influence of processing speed on VF performance might be beyond what can be described as a linear relationship. This might explain why the impact of speed is detectable in the prediction computation but was not found the correlation analysis.

In addition to the relationship between EFs and VF this study also assessed the influence of hormonal fluctuations to investigate the influence of inter-individual differences. The results showed that the stress hormone cortisol and the sex hormone estradiol have a high impact on VF performance. In line with other studies44,45 our analyses indicates that there is a negative correlation between cortisol level and the performance in cognitive functions. However, previous studies have also linked an increase of cortisol to better cognitive performance40,41. To our knowledge, rather little is known about the influence of estradiol on VF performance. However, studies investigating the influence of estradiol on EFs show that higher estradiol levels particularly leads to better performance in shifting and cognitive flexibility tasks37,39. Moreover, a link between hormonal contraceptives and VF performance43 has been shown. These results demonstrate that women taking hormonal contraception and consequently having significantly lower estradiol and progesterone levels, perform worse in the VF task than the control group43. The high impact of cortisol and estradiol in the prediction analysis suggest that fluctuating hormones are essential parameters for predicting VF performance and that intra-individual differences in hormone levels need to be considered when examining the relationship of cognitive functions and speech production tasks. Thus, it shows that although EF test variables are closely related to VF performance VF is a complex construct which is also driven by hormones and attention.

Our prediction analyses yielded important insights into the relationships between EFs, VF and inter-individual differences. However, some open questions remain concerning both inter-individuality and speech related topics. Firstly, due to the fact that inter-individuality influences both EFs and VF performance, further studies would benefit from gathering additional inter-individual parameters. For example, a test for assessing intelligence might be useful to control for the influence of intelligence on each test, especially on the SPM, making it possible to better differentiate the impact of cognitive flexibility on VF performance. Secondly, intra-individual differences could be further investigated by gathering saliva samples at two different time points. In this study saliva samples of each participant were pooled. A comparison of the hormones level before and after testing might help to provide insights into individual strategies dealing with stress. Beside inter-individual influences of hormonal levels on EFs41, studies also report intra-individual variety e.g. due to different phases of the menstrual cycle38. Therefore, an analysis of hormonal levels within each participant taken at different time points could reveal an additional dimension representing intra-individual differences.

Considering speech-specific issues, a vocabulary test could contribute to better understand inter-individual differences. Previous studies showed that the vocabulary size has a positive impact on VF performance54,70. Moreover, additional parameters reflecting VF performance could help to gain deeper insights of searching strategies during VF tasks. In particular, semantic analyses provide details of clustering and switching24 and could indicate the participant’s strategies which could then be linked to EF performance.

A more general consideration is related to the predictive methods as used in this study. An independent data set assessing the same variables that were used in the study does not yet exist. Thus, it was not possible to validate our results in a totally independent dataset. Instead, we applied 10-fold cross-validation by repeatedly training the model on parts of the data while keeping a subset out as a validation sample. However, we are aware of the need to validate our results in an independent dataset to better generalize our results and suggest a replication of this study on an independent sample which could prevent study-specific biases. However, due to the broad and specific collection of the EF test battery finding a similar data set could be difficult. An additional independent dataset with similar EF tests could be used to test split-half reliability investigating the construct of EF tests. Due to the high number of participants which is needed to apply machine learning methods it was not possible to split our data in two groups and running the prediction analysis on the split data. The ambiguous results of the RVM and PLS analysis also need to be considered. While both approaches revealed a significant correlation between true and predicted values, the PLS approach did not identify any significant features (Supplement 2). This might be due to the fact that PLS is a non-sparse machine learning method, which will include all features in the prediction model. In contrast, it is the nature of sparse models like RVM to build the prediction model based on most relevant features only.

Due to the high number of participants and the large battery of EF tests this study provides a detailed view on the involvement of EFs in VF tasks and examines the influence of fluctuating hormones. It investigated to what extent EF tests can represent semantic VF performance and shows that cognitive flexibility and inhibition are the main domains involved in performance on the VF task. Additionally, attention seems to be a central component of the VF task. The most striking observation to emerge from the data analysis was the new and more detailed view of the EF tests and variables that are best at predicting VF performance. While correlation analyses provided first insights into the relationship of EFs and VF, the prediction analyses revealed the importance of speed parameters. In particular, our results suggest that beside the influence of specific EFs, more general components such as attention and speed are crucial aspects of successful VF performance. These results also highlight the advantage of the prediction analysis since it revealed concrete variables of EF tests which also represents cognitive abilities not directly linked to specific EF subdomains or representing standard variables.

A better understanding of the cognitive demands that are required for the successful performance of VF tasks can potentially lead to a more wide-spread use of VF tests in the clinical context, thus EF tests that tend to be time-consuming and inaccurate. Additionally, VF tests tend to better reflect real-life conditions than lab-based EF batteries. A detailed knowledge of meaningful test variables could later on lead to insights into which subdomains of EFs could be replaced by VF tasks and which subdomains of EFs still have to be assessed by additional EF tests. This link between EF and VF represents a first step towards a speech-based EF-test. Furthermore, it indicates that in investigating the relationship of EF and VF the complex construct of VF performance should be considered in research and clinical context.

Furthermore, taking the influence of varying hormonal levels into account our study suggests that beside inter-individual differences intra-individual fluctuations could play an important role in evaluating VF performance in clinical context.