Introduction

Working memory is a temporary storage system under active attentional control1. The capacity of working memory is highly limited. For example, on average, people can only hold 3-4 items in visual working memory2,3,4. Although having a limited capacity, working memory has been shown to play an essential role in cognitive functions, such as fluid intelligence5, reasoning6, reading comprehension7,8, second language proficiency of adult learners9, and academic attainment10.

However, although research has involved an assumption that different kinds of estimation are all reliable measurements for working memory capacity, this has not been thoroughly examined. Initially, researchers focused on verbal working memory by using paradigms such as verbal recall tests8, the classic digit N-back task11, and the number counting task12. The test-retest reliability of working memory capacity estimation based on these tasks has been examined. For example, a study used verbal recall tasks to test 4~15-year-old children and retested them after 5~10 days. The results showed that the verbal recall tasks had an average test-retest reliability of 0.56 for non-word-syllables, 0.72 for words, 0.81 for digits, and 0.61 for sentences and counting numbers13. Another study showed that the test-retest reliability of working memory capacity in a spatial N-back paradigm varies from 0.493 to 0.857, depending on the N and whether accuracy or reaction time of the task is calculated14.

Studies on visual working memory (VWM) have relied on the change detection task to quantify VWM capacity2,3. Participants usually show a limitation of visual working memory capacity at about 3–4 objects’ worth of information4, which has been linked with visual search15 and multiple-object tracking performance16. A study also revealed that individuals with higher VWM capacity are more efficient at excluding unnecessary items during task performance17.

Since researchers use an estimation of individual VWM capacity to account for individual differences in other cognitive functions, it is essential to evaluate the reliability of VWM capacity estimation based on the change detection task. However, only a few studies have examined the reliability of VWM capacity estimation. For instance, Johnson et al.18 examined the test-retest reliability of VWM capacity estimation with a separation of 1.5 years between the two tests. They found a test-retest correlation of 0.77. However, since their study only sampled 31 schizophrenia patients and 14 healthy participants, the representativeness of the result is questionable. Thus, whether VWM capacity estimation is reliable across tests remains essentially unknown.

Test-retest reliability is critical not only for measuring individual differences but for the reproducibility of psychological findings. For example, a large reproducibility project19 examined the reproducibility of 100 experiments published in social and cognitive psychology journals. The results showed that only 38 of 100 experiments were rated to have replicated the original results. Thus, the test-retest reliability of VWM capacity estimation provides a baseline level of the reproducibility of VWM capacity estimation given that the subjects and task are the same in the two tests.

Some researchers have suggested that VWM capacity may not have a satisfactory internal consistency across different set sizes in a single test20 because estimation of visual working memory capacity relies heavily on the set size (number of to-be-remembered items) of visual stimuli in the change detection task2,3. It has also been suggested that change detection tasks using a single-probe (only the possibly-changed square was tested) and using a whole-display-probe (all squares including the possibly-changed one were tested) yield different reliability20. However, studies have shown that there is a high correlation between different types of memory materials in the change detection task21 and different working memory tasks22. Furthermore, Xu et al.23 conducted one session of single-probe tests with 78 participants every day for 30 continuous days and the 31st session after a month. The correlation coefficientsbetween adjacent sessions varied from 0.53 to 0.81, and importantly they were positively correlated with their session numbers, indicating an increased test-retest reliability as sessions grew in number. However, the test-retest reliability of VWM capacity estimation in a whole-display-probe task has not yet been investigated.

In addition to set size and repeated practice, other factors might also influence the test-retest reliability of VWM measures. For example, it is known that time of day affects people’s ability to retrieve information from long-term memory24. A recent study showed that there are also circadian variations in the accuracy of visual working memory performance using a spatial working memory task. The variation of visual working memory performance showed a positive correlation with the rectal temperature by a 3-hour delay25. Moreover, the time of day effect can be modulated by participants’ chronotype. For example, the performance of cognitive tasks is better when individuals are tested at their preferred time (e.g. the morning for morning-type people, and the evening for evening-type people)26. Additionally, some other daily functions and activities, such as glucose absorption27, exposure to computer screen light28, a single session of physical exercise29 and so on, can also immediately influence cognitive performance. Therefore, it can be expected that time of day might also affect the estimation of visual working memory capacity.

In summary, some studies have examined the test-retest reliability of visual working memory capacity estimation18,23. However, the reliability of the lateralized whole-display-probe task has not yet been examined. Most electroencephalography (EEG) studies of visual working memory have used a “lateralized” variant of the whole-display-probe task, in which participants pay attention to stimuli in a hemi-field and ignore the distractors from the contra-lateral hemi-field. According to Pailian and Halberda’s work20, reliabilities can be different in different visual working memory tasks. Thus, the present study aimed to examine the test-retest reliability of visual working memory capacity estimation with a lateralized whole-display-probe change detection task. Additionally, the present study also examined whether variation in testing time is an important source of test-retest variation for visual working memory capacity.

Method

Participants

96 subjects participated in this study (18~28 years old; 48 females). Before the beginning of the experiment, they signed the informed consent form and received monetary compensation. All of the participants had normal or corrected to normal vision. This study was approved by the Institutional Review Board of the Institute of Psychology, Chinese Academy of Sciences. Moreover, all experiments were performed following relevant guidelines and regulations.

Apparatus, stimuli and procedure

The present study adopted the change detection paradigm to measure visual working memory capacity2. The stimuli were presented on a 19 inch CRT display with a resolution of 1280 × 1024 pixels at a refresh rate of 75 Hz. Participants were seated 60 cm from the display with their heads rested on a chin-rest.

The events of a trial are illustrated in Fig. 1. In each trial an arrow was presented above a central fixation point for 200 ms, pointing to either the left or right to indicate a cued hemi-field. After a 300 ms blank interval, the first array of 4, 6, 8, 10, or 12 colored squares appeared on the gray background for 100 ms. The size of each square was 0.65° * 0.65°, with a minimum distance of 2° between two squares. An equal number of colored squares were distributed in the left and right hemi-fields. Participants were instructed to only memorize the colored squares in the cued hemi-field. Then a blank screen appeared for 900 ms, which was followed by the second array of colored squares for 750 ms. The second array of squares was either identical to the first array or different from the first array in that one square in the cued hemi-field changed color. Then a blank screen remained until the participant responded by pressing corresponding keys. Since participants only had to memorize the squares in the cued hemi-field, the memory set size was 2, 3, 4, 5 or 6.

Figure 1
figure 1

The schematic illustration of events in a trial.

The first 35 participants were tested with memory set sizes of 2, 3, 4 or 5. However, the capacity estimate for two participants equaled 5, indicating that their working memory capacity might exceed the largest memory set size of 5. Thus, the largest memory set size was increased from 5 to 6. The other 61 subjects were tested with memory set sizes of 2, 3, 4, 5 or 6.

In the task, there were 100 trials for each set-size, and all trials were presented randomly. There was a 30-second break after every 100 trials. Each participant had to take the same test twice with a separation of 3 to 16 days.

Working memory capacity measurement

According to Pashler30, the capacity of visual working memory is measured by the K value, the formula of which is as below. Compared to the other commonly used method proposed by Cowan31, Pashler’s method is more appropriate for the current paradigm32. The formula is as follows in equation 1:

$$K=setsiz{e}^{\ast }(P(hit)-P(FA))\div(1-P(FA))$$
(1)

where P(Hit) = hits/(hits + misses), and P(FA) = false alarms/(false alarms + correct rejections).

In addition to the K values of each set size, we also computed the average K value (Kmean) and maximum K value (Kmax) across all set sizes as estimations for each participant’s visual working memory capacity.

To assess the test-retest reliability of K estimates, correlations between K estimates of the two tests were calculated. The test-retest reliability was considered good when r ≥ 0.71; fair 0.51 ≤ r ≤ 0.70; weak 0.31 ≤ r ≤ 0.50; little or none r ≤ 0.333.

Analysis of the time of day effect

The 96 subjects were divided equally into 2 groups based on the median time difference (48 minutes) between the two tests. The Same-time group (48 subjects, the time difference ranged from 0~44 minutes, mean = 13.23 min, sd = 10.02) and Different-time group (48 subjects, ranging from 52~688 minutes, mean = 217.44 min, sd = 162.95). The test-retest correlation coefficients of the two groups were compared by using Snedecor’s method34. This method compares correlations of two independent samples on the same pair of variables by using the following formula 2:

$$Z=\frac{{Z}_{1}-{Z}_{2}}{\frac{1}{({n}_{1}-3)+({n}_{2}-3)}}$$
(2)

Z1 and Z2 are Fisher’s Zs35 for correlation coefficients. Also, n1 and n2 are the sample sizes of the two independent samples.

Results

Descriptive statistics

The hit rates, false alarm rates, and K values under each memory set size for the two tests are listed in Table 1.

Table 1 The means and standard deviations of hit rates, false alarm rates, and K values.

Test-retest improvement

The K values were submitted to a repeated-measures ANOVA with memory set size and test-retest as two within-subject variables (see Fig. 2). Results showed a significant main effect of test-retest, F = 103.104, p < 0.001, η2p = 0.632, with the larger K in the second test than in the first test. The main effect of memory set size was also significant, F = 220.563 p < 0.001, η2p = 0.935, with larger K values as set size increased. The interaction was also significant, F = 48.566, p < 0.001, η2p = 0.647. Further analysis showed that test-retest improvements were significant at all levels of memory set size (all ps < 0.05).

Figure 2
figure 2

The average K values for each set size in the two tests.

The test-retest improvements of K values were possibly due to the enhancement of hit rate or the decline of the false-alarm rate. To address this, we undertook ANOVA analysis on both hit rate and false-alarm rate as for K values. Results showed that hit rate (F = 49.081, p < 0.001, η2p = 0.454) was improved in the second test compared with the first test. Also, the false-alarm rate in the second test (F = 9.822, p = 0.003, η2p = 0.141) was reduced compared with the first test.

Test-retest reliability

Pearson product-moment correlations between the K values of the two tests under each set size are listed in Table 2. Results showed that at all set size levels, the K values in the second test were significantly correlated with the corresponding K values in the first test.

Table 2 Pearson’s correlations between Ks for each corresponding set size in the two tests.

Also, as the memory set size increased, the test-retest reliability coefficients for each corresponding set size also showed a rising trend (Fig. 3). To examine whether this trend was significant, the modified Pearson-Filontest36 as in equations (3) and (4), was used to compare test-retest reliability coefficients across set sizes. This test is to compare correlation coefficients within the same sample but with non-overlapping pairs of variables37.

$${Z}_{PF}=\sqrt{\frac{n-3}{2}}\times \frac{{Z}_{AB}-{Z}_{XY}}{\sqrt{1-\frac{kPF}{2(1-{r}_{AB}^{2})(1-{r}_{XY}^{2})}}}$$
(3)

where kPF is computed by

$$\begin{array}{rcl}kPF & = & ({r}_{AX}-{r}_{BX}\times {r}_{AB})\times ({r}_{BY}-{r}_{BX}\times {r}_{XY})+({r}_{AY}-{r}_{AX}\times {r}_{XY})\\ & & \times \,({r}_{BX}-{r}_{AX}\times {r}_{AB})+({r}_{AX}-{r}_{AY}\times {r}_{XY})\times ({r}_{BY}-{r}_{AY}\times {r}_{AB})\\ & & +\,({r}_{AY}-{r}_{AB}\times {r}_{BY})\times ({r}_{BX}-{r}_{BY}\times {r}_{XY})\end{array}$$
(4)

A, B, X, Y represent the two non-overlapping pairs of variables. ZAB and ZXY are the Fisher-z transformed result of rAB and rXY.

Figure 3
figure 3

The test-retest correlations between Ks for each set size in the two tests. Panel a illustrates the correlations between K for set size 2 in the two tests. Panel b illustrates the correlations between K for set size 3 in the two tests. Panel c illustrates correlations between K for set size 4 in the two tests. Panel d illustrates correlations between K for set size 5 in the two tests. Panel e illustrates correlations between K for set size 6 in the two tests.

Results showed that only the test-retest reliability coefficients for set size 6 were significantly higher than those for set size 2 (Tables 3 and 4). The sample size was 61 in each pair comparison including set size 6, and 96 in those not including set size 6.

Table 3 Comparisons between test-retest reliability coefficients for each set size.
Table 4 Comparisons between test-retest reliability coefficients for each set size.

The internal consistency across memory set sizes within a single test

The within-test consistency across set sizes refers to the correlations between K values for different memory set sizes within a single test. The correlations between K values for different memory set sizes in the first test are listed in Table 5. All of the correlations were medium or high.

Table 5 Pearson’s correlations between Ks for each set size in the first test.

We also examined the within-test consistency across set sizes in the second test. The correlations between K values for different memory set sizes in the second test are listed in Table 6. All of the correlations were medium or high.

Table 6 Pearson’s correlations between Ks for each set size in the second test.

The time of day effect on test-retest reliability

The test-retest reliability coefficients for the two groups (the same time of day v.s. different time of day) are listed in Table 7. We examined whether participants who took the two tests at the same time of day showed a higher test-retest reliability than participants who undertook the two tests at different times of day. As Table 7 shows, in the K(setsize3), K(setsize5), K(mean) and K(max) conditions, the test-retest reliability coefficients for the same time of day group were significantly higher than those for the different time of day group.

Table 7 Test-retest reliabilities for the same-time group and the different-time group.

Discussion

The present study examined the test-retest reliability of VWM capacity estimation by using the change detection paradigm. The test-retest correlations between K values for corresponding set sizes in the two tests were comparable to those in the verbal recall tasks, as found by Gathercole, et al.13, varying from fair to good. The results also showed a trend that the test-retest reliability increased as the set size became larger. In addition, we found that the test-retest reliability of VWM capacity estimation was higher when the two tests occurred at the same time of day than at different times of day.

The test-retest reliability of VWM capacity estimation (K values) was reasonably good in the present study, indicating that K values can be a relatively reliable index of individual VWM capacity. Secondly, a trend was found showing that the larger the memory set size, the higher the corresponding test-retest reliability for the K values. This trend was possibly due to a ceiling effect at set sizes 2 and 3 (see Fig. 2a for set size 2 and Fig. 2b for set size 3), which were lower than many participants’ true VWM capacity4. This ceiling effect made it difficult to detect individual differences among participants. Thus, it is important for future work to use a memory set size that is at least larger than 4 when estimating individual VWM capacity.

In this study, the performance of the second test was better than that of the first one, indicating a practice effect with a 3~16 day interval (the time between 2 tests) between the two tests. This finding is inconsistent with Johnson et al.’s18 research, which found no practice effect for visual working memory tests with a 1.5-year interval. It is possible that the practice effect waned with the greater interval between the two tests. However, the critical point of tim eat which practice effects disappear has yet to be studied. Thus, it is important for future work to allow for sufficient practice before formal measurement because the K value based on a short version of the change detection task might result in underestimation of VWM capacity. This conclusion is similar to that from Xu and colleagues’ work, which found that adjacent-session correlation increased along with session number within 31 sessions23. However, more factors able to influence retest reliability should be discussed. It is not practical to test research subjects dozens of times in every experiment.

Reasonably good correlations were found between different set sizes in the same test, in contrast to the quite low correlations between different set sizes in a previous study20. This discrepancy might be due to methodological differences between the studies. First, the included set sizes were different. The current research used memory set sizes of 2, 3, 4, 5, and 6, while Pailian and Halberda20 tested memory set sizes of 2, 4 and 8. The larger step size of the memory set sizes in the Pailian and Halberda20 study might account for the lower correlations between different set sizes. Second, a larger sample was included in the current research (96 participants) compared with the 14 participants in Pailian and Halberda’s20 study.

Test-retest reliability was higher when the two tests occurred at the same time of day than at different times of day. This time of day effect was significant for the mean and maximum values of Ks across set sizes, and also for the values of K for set sizes 3 and 5. According to a meta-analysis in Schmidt and colleagues’ review, cognitive performance in a phonological working memory task25, visual working memory task25 and visual selective attention38 can be influenced by circadian rhythm26. Also cognitive performance can be affected by daily activities27,28,29. Therefore, at different times of day, participants’ performance varied in different directions, causing a reduced correlation of the test results, while tests taken at the same time of day had higher retest reliability. However, it is worth noting that the time of day effect on test-retest reliability was not significant for set sizes of 2, 4 and 6. There are several possible explanations for this variation. First, the K value for a specific set size might not be sufficiently reliable, especially for small set sizes. Second, time of day was a rough correlate of circadian rhythms and daily activities, which might also add variation to the results. Finally, the relatively smaller sample in set size 6 might explain the absence of a time of day effect. Thus, the time of day effect needs further examination.

In conclusion, with appropriate control, the estimation of VWM capacity based on the change detection task is highly reliable. The average or maximum K values across multiple set sizes are also reliable. Results from the current study also indicate that future research on measuring individual working memory capacity should use set sizes larger than 4, and should allow for sufficient practice before formal measurement.