Main

The relevance of neuropsychological testing to psychiatric diagnosis and treatment has been the focus of recent investigation (Sharma and Harvey 1999). Traditional neuropsychological test batteries have been developed and validated in neurological populations with focal and diffuse lesions (Benton 1994; Kaplan 1990). Assessment of cognitive deficits has become integrated in neurological research and practice. For example, a consensus battery established for Alzheimer's research (Consortium to Establish a Registry for Alzheimer's Disease; CERAD) has benefited the field by facilitating collaborative studies across centers (Morris et al. 1989). Similarly, the value of neuropsychological testing in assessing effects of toxin exposure has been recognized, and standards have been established (Baker et al. 1985). Neuropsychological batteries usually include measures of executive functions—abstraction and mental flexibility, attention—as well as verbal and spatial memory, language, spatial processing, and sensorimotor function (Benton et al. 1994; Golden et al. 1991; Halstead 1947; Jarvis and Barth 1994; Reitan and Davison 1974; Saykin et al. 1991, 1995). These batteries have provided a method to link domains of cognitive performance with regional brain functioning, and algorithms have been developed to test models of such linkage formally (e.g., Gur et al. 1990). Application of neuropsychological batteries in psychiatric research and practice has been initially slow, but more recently, it has become consolidated into efforts to understand the involvement of neural systems in the pathophysiology of such major disorders as schizophrenia (Censits et al. 1997; Gold et al. 1992; Goldberg et al. 1987; Heinrichs and Zakzanis 1998; Mirsky and Duncan 1986; Saykin et al. 1991, 1994).

Although the traditional neuropsychological batteries have been useful in establishing links between behavioral deficits and brain dysfunction, they have several limitations when considered in the context of clinical neuroscience and treatment research. Weaknesses related to use in large-scale multicenter studies are length, complexity of administration and scoring, and vulnerability of data-handling methods. Traditional batteries take several hours to administer and require a trained technician supervised by a neuropsychologist. Scoring often involves expert judgment, and procedures for guarding against drifts are essential. Data entry is manual with inherent necessity for costly procedures to eliminate errors. An additional limitation of traditional batteries is that they are difficult to administer during functional imaging studies aimed at probing brain systems that regulate behavior (Gur et al. 1992). Subtests in these batteries are typically constructed to detect broadly defined deficits, rather than measure specific, unitary neurocognitive constructs. This feature hinders bridging, on the one hand, to basic human research in cognition and, on the other hand, to animal work. Finally, traditional tests necessarily confound speed and accuracy, two complementary features of performance that have been shown to relate differently to regional brain activation (Gur et al. 1988).

To address these limitations, we have developed a set of computerized neurobehavioral measures aimed specifically at integrating structural and functional neuroimaging studies. Our general approach to task development and validation was previously detailed (Gur et al. 1992). Here we present data on the implementation of the computerized tests in a normative sample that has also received the traditional battery. Because the computerized tests were designed to tap very specific neurocognitive systems and not to mirror the traditional battery, our goal was not to replace the traditional battery for all neurocognitive domains. However, the computerized tests retain reasonable psychometric properties, and our more limited goals were to present the psychometric properties of the computerized measures and to compare the sensitivity of the computerized and traditional measures to sex differences and age effects, which are major moderating variables in studies of brain and behavior (Kimura and Harshman 1984; Kramer et al. 1997; Saykin et al. 1995; Zec 1995). For a neurocognitive scan to be considered sensitive to normal individual differences in these measures, it will need to show that: (1) women perform better on memory tasks (McGivern et al. 1997; Ruff et al. 1989), relative to men who perform better on spatial processing and motor tasks (Caplan et al. 1997; Collins and Kimura 1997; Gur et al. 1999; Saykin et al. 1995; Silverman et al. 1996); (2) age is associated with decline in performance, which is more pronounced for frontotemporal functions (abstraction, attention, memory), and more pronounced for speed than for accuracy (Earles and Kersten 1999; Laursen 1997; Park et al. 1996; Sliwinski and Buschke 1999; Verhaeghen and Salthouse 1997).

In addition to comparing the computerized and traditional measures on sensitivity to sex differences and age effects, we sought to examine whether the entire array of computerized tests could be considered a “scan” of neurocognitive abilities that would yield comparable measures of major neurocognitive domains. The main difficulty in correlating the traditional with the computerized measures is that the former confound speed and accuracy, while the latter approach separates them. To make the scores comparable, we defined “efficiency” scores for the computerized battery, in which accuracy is divided by speed (performance accuracy per unit of time). Another difficulty in comparing the scores is the way in which the tasks were designed. Traditional tests are aimed at broadly defined domains of major clinical relevance, while the computerized tasks are narrowly defined to activate specific brain circuits. Although specific tests from the computerized scan have compared favorably with similar tests from the traditional battery (Kurtz et al. 2001; Ragland et al. 1995; Glahn et al. 1997; 2000), the traditional measures sample broader aspects of behavior and we had to “blur” the grouping of the computerized tasks to include measures of Executive, Memory, Intellectual, and Sensorimotor functions.

METHODS

Subjects

Participants were 92 healthy individuals (44 men, 48 women), recruited by newspaper advertisement. They underwent medical, neurological, and psychiatric (SCID-NP; Spitzer et al. 1996) evaluations (Shtasel et al. 1991) including laboratory tests. Subjects had no history of a disorder or event that might affect brain function including hypertension (blood pressure >140/90), cardiac disease, diabetes, endocrine disorders, renal disease, chronic obstructive pulmonary disease, cerebrovascular disease, head trauma with loss of consciousness, seizure disorder, migraines, or other neurological condition. They had no family history of schizophrenia or affective illness in first degree relatives. Men were 30.9 ± 10.7 years old (range 19.2–69.1) with 15.1 ± 1.8 years of education (range 12–18). The corresponding values for women were age: 27.2 ± 8.1 years (range 18.2–62.5); education: 15.8 ± 2.3 (range 12–22). Men and women did not differ on any of these measures.

Neuropsychological Assessment

Traditional Neuropsychological Battery

A comprehensive paper and pencil neuropsychological test battery (Saykin et al. 1994, 1995; Censits et al. 1997) was administered by trained examiners. A second examiner independently rescored test data to eliminate errors and permit assessment of interrater reliability. These scores are entered into a computerized database and checked for range, internal consistency, and secular drifts. The test battery is listed in Table 1 including references for administration and scoring procedures used.

Table 1

Test scores were standardized relative to all healthy people in our database (z-scores; mean=0, standard deviation = 1), and grouped into eight summary measures by averaging each subject's z-scores on tests assessing the same functional domain. Following Censits et al. (1997), summary measures were calculated for the following domains: abstraction (ABF), attention (ATT), verbal memory (VMEM), spatial memory (SMEM), language abilities (LAN), spatial abilities (SPA), sensory (SEN), and motor functions (MOT). The variables comprising these domains are presented in Table 1.

Computerized Neuropsychological Scan

The computerized neuropsychological tests were designed to yield quantitative measures of performance on behavioral domains that can be linked to regional brain function. All tasks were developed on Macintosh® computers using the PowerLaboratory® platform (Chute and Westall 1997). The test battery is listed in Table 2 including references for published tasks. Because not all tests in the computerized battery have been published, a brief review of the tasks is provided. The task development and validation process were detailed in Gur et al. (1992).

Table 2

Briefly, the tasks are designed to be used in neuroimaging studies, yet provide sufficient psychometric sensitivity to enable their application in neuropsychological assessment. Tasks first underwent a process of construction and evaluation by a team of experimental and clinical neuropsychologists. This group took the tasks through conceptualization, initial item analysis, and construction of a preliminary version. The resulting version was submitted to psychometric study to assess reliability and construct validity and obtain normative data. Data were also obtained at that stage to document comparability of test versions and the effects of retesting (“practice effects”). The tests listed in Table 2 have undergone this process and can be administered using desktop or portable computers. The tests are as follows:

Abstraction Inhibition and Working Memory Task (AIM)

The AIM (Glahn et al. 2000) is designed as a measure of abstraction and concept formation with and without additional working memory loads. It presents subjects with five shapes: two in the upper right and two in the upper left corner of a computer screen, with a fifth target object appearing in the center of the screen, below the other stimuli (Figure 1a). The participant's task is to pair the target object with the objects on either the left or right. On half the trials, an additional working memory maintenance requirement is superimposed on this basic module by adding a delay between the presentation of the target and other objects. Total number correct and median reaction time for correct responses were selected as performance measures.

Figure 1
figure 1

Examples of stimuli used in the neurocognitive scan

Penn Inhibition Test (PIT)

The PIT was based on the competing programs component of the Executive Control Battery (Goldberg et al. 1989). The task is divided into visual and auditory subtests. The visual subtest begins by presenting the subjects either an individual blue dot or two sequential blue dots, with the instruction to make a single key press for two dots and a double key press for a single dot. Individuals are required to perform to a criterion (10 sequential correct responses) on each stimulus type before advancing to the next stage. During the second stage, subjects are presented with 20 trials in which stimuli are randomly alternated. This is followed by a final stage in which each stimulus type is presented for nine trials followed by a tenth trial, during which the alternate stimulus is presented. The auditory subtest follows the same format, but instead of visual dots, presents auditory clicks. The individual is instructed to make a single key press when they hear two sequential clicks and a double key press when they hear a single click. Total number correct and median reaction time for correct responses were selected as performance measures.

Raven's Progressive Matrices

This is a computerized version of the standard paper and pencil task (Raven 1960). It is a multiple-choice task that requires subjects to conceptualize spatial, design, and numerical relationships that range in difficulty from very easy to increasingly complex. The computerized version was constructed by scanning and digitizing the stimulus cards from the original task. Instructions and scoring follow standard published procedures. Total number correct and median reaction time for correct responses were selected as performance measures.

Stroop

This is a computerized version of the standard paper and pencil task (Stroop 1935), designed to test an individual's ability to shift his or her perceptual set to conform to changing task requirements. The task requires subjects to name the color of the ink in which the words are written quickly, ignoring the content of the word. On some trials, the stimuli are incongruent (e.g., the word “blue” is printed in yellow ink), and the subject must inhibit the prepotent response of reading the word, rather than the color in which it is printed, to respond correctly. Because of problems with voice recognition software, the computerized version of the task does not accept oral responses and, instead, requires subjects to make a button press on a computer game pad that is configured with red, blue, yellow, and green keys. Therefore, on incongruent trials on the computerized version, the subject must press the colored button that matches the color of the word and inhibit pressing the colored button that matches the content of the word. The task randomly presents 30 congruent and 30 incongruent trials. Total number correct and median reaction time for correct responses were selected as performance measures.

Penn Continuous Performance Test (PCPT; Kurtz et al. in press)

During this task, the participant is asked to respond to a set of vertical and horizontal lines (a seven-segment display) whenever they form a digit (Figure 1b). Because each judgment is made on the basis of the present stimulus, working memory demands are minimized. Total number of true positive responses and median reaction time for true positive responses were selected as performance measures.

Penn Word Memory Test (PWMT; Gur et al. 1993)

The PWMT is a forced-choice recognition task in which participants are shown 20 target words and asked to try to remember them. The encoding trial is followed by immediate and 20 min delayed recognition trials, during which subjects are presented with the target words mixed with 20 distractors. New distractors are presented at each delay (for a total of 40) and are equated for frequency, length, concreteness, and imageability using Paivio's norms (Paivio et al. 1968). Total number of true-positive responses and median reaction time for true-positive responses were selected as performance measures.

Penn Face Memory Test (PFMT; Gur et al. 1993)

The PFMT parallels the PWMT and consists of 20 target faces and 40 foils (20 for each test trial). Stimuli are black and white photographs of faces balanced for gender and age (Figure 1c). All faces are of neutral emotional expression, as determined by 12 raters. Procedures for administration and scoring of the task are identical to those for the Word Recognition Task. Total number of true-positive responses and median reaction time for true-positive responses were selected as performance measures.

Visual Object Learning Test (VOLT; Glahn et al. 1997)

The VOLT was designed as a spatial analog of the California Verbal Learning Test (Delis et al. 1983). It uses 20 Euclidean shapes as learning stimuli (Figure 1d) that are presented over four learning trials, followed by short and long delay test recall. New distractor shapes are used in every test trial. Total number and median reaction time of true-positive responses across learning trials, short, and long delay were selected as performance measures.

Penn Verbal Reasoning Test (PVRT; Gur et al. 1987)

The PVRT consists of age-appropriate verbal analogy problems. Individuals are presented with 30 analogies in a multiple-choice format. Total number correct and median reaction time for correct responses were selected as performance measures.

Computerized Judgment of Line Orientation (CJOLO)

This is a computerized adaptation of the original paper and pencil task (Benton et al. 1975). Participants are shown two lines at an angle and are asked to indicate the corresponding lines on a simultaneously presented array. Task difficulty is defined by the length of the stimulus lines. Instructions and scoring follow standard published procedures. Total number correct and median reaction time for correct responses were selected as performance measures.

Pursuit Rotor Task (PRT)

This is a standard rotor-pursuit paradigm incorporated into the PowerLaboratory® platform (Chute and Westall 1997). The task consists of five 60 s blocks during which the subject is required to trace a dot moving around the circumference of a circle at a rate of 100 mm/s. Blocks are separated by 30 s, and subjects use a light pen on a digitizing tablet to make the response. The difference in time on target (in ms) during trial 5 minus time on target during trial 1 is calculated as a learning score and selected as a performance measure.

These individual scores were grouped to scan the following functional domains: abstraction and flexibility (ABF), attention (ATT), verbal memory (VMEM), face memory (FMEM), spatial memory (SMEM), language (LAN), spatial abilities (SPA), and sensorimotor (SM). The variables comprising these functions are presented in Table 2. Note that most assignments of tests to domains are straightforward and unambiguous. The exception is the assignment of the Stroop as an attentional test, when it could be argued that it measures mental flexibility and belongs to ABF. Changing the assignment of this measure did not affect the over-all results, and analyses comparing traditional with computerized measures combined ATT and ABF into a single “executive” domain.

Procedures

After establishing that participants were healthy, the purpose of the study was explained and informed consent was obtained. The traditional battery and the computerized scan were administered in a counterbalanced order within a week (mean±SD 2.4±1.7 days). All participants received both procedures. Tests were administered in a standard testing room, which was well lit and quiet. The traditional battery was administered by trained neuropsychology Fellows; whereas, the computerized scan was administered by Fellows or research assistants.

For both the traditional battery and the computerized scan, procedures were included to ensure that participants understood the instructions and could provide valid responses. For the traditional battery, this was done by interaction with the examiner. For the computerized scan, subjects first activated a program that enabled them to demonstrate facility in using the pointing device, and a brief training procedure preceding each test evaluated comprehension of test instructions and response requirements.

Data Analysis

The tests in both batteries were grouped into their functional domains using z-scores as described above. To test the sensitivity of each battery to sex differences in the neurocognitive profiles, a “female typical” gradient (Gur et al. 1999) was calculated as the average memory score (where women are expected to perform better than men) minus the spatial score (where men are expected to outperform women). For the traditional battery, the memory score included verbal and spatial measures; whereas, for the computerized scan, we added the face memory score. The hypothesis of sensitivity to sex differences was tested by determining whether the female typical gradient differed from zero within sex, higher than 0 in women, and lower than 0 in men (paired t-tests, a stringent test of the hypothesis). We also compared this gradient between males and females to evaluate the hypothesis that the score is higher in women than in men (between-group t-test, a less stringent test).

Sensitivity to the effects of aging was evaluated by calculating the Pearson product moment correlations between age and performance on the functional domains. It was hypothesized that age effects are more pronounced for frontotemporal functions: abstraction, attention, and memory. The computerized scan also permitted testing the hypothesis that age associated decline is more pronounced for speed than for accuracy.

Direct comparison of the two approaches is hampered by differing goals and the confounding of speed and accuracy in the traditional battery. However, such comparison is desirable to establish test order effects and to examine their correlation. To overcome the greater targeted specificity of the computerized battery and consequent lack of overlap in specific neurocognitive domains, we have combined attention and abstraction/flexibility on both batteries into an “executive functioning” measure, and word, face, and spatial memory on the computerized battery to compose a “memory” measure that was compared with the average of oral and spatial memory on the traditional battery. Sensory was averaged with motor domains on the traditional battery profile to compare to the sensorimotor domain on the computerized profile. To overcome the difficulty of an inappropriate correlation of measures that separate accuracy from speed with those that do not, we have generated an “efficiency” index for the computerized battery. Efficiency was defined as the score for accuracy divided by the score for speed (performance accuracy per unit of time). Order effects were then examined by analysis of variance (ANOVA) on each domain measure, with order (dummy coded 0 = traditional first, 1 = computerized first) as a grouping factor and approach (traditional, computerized) as a repeated-measures (within-group) factor. The correlation matrix between the traditional and computerized measures was calculated to examine comparability of scores.

RESULTS

Table 1 provides the means of men and women on the tests comprising the traditional battery, and Table 2 provides these means for the computerized scan. As can be seen in Table 2, the tests comprising the computerized scan had moderate to very high reliability. For comparability across tests and domains, subsequent analyses were performed on the z-transformed data.

Sex Differences

The “female typical” gradient showed the hypothesized direction of being higher for women than for men in both the traditional battery and the computerized scan. For the traditional battery, the gradient was significantly negative for men: −0.37 ± 0.66, t (paired) = 3.73, p < .001, but was not significantly positive for women: 0.11 ± 0.79, t (paired) <1, ns. However, the difference between men and women was significant, between-group t = 3.16, df = 90, p < .001. For the computerized scan, the gradient was significantly negative for men: −0.39 ± 1.05, t (paired) = 1.99, p < 0.025, and significantly positive for women: 0.39 ± 0.85, t (paired) = 2.66, p < .01. The between-group difference was also significant, t = 3.24, df = 90, p < .001.

The profiles of men and women (Figure 2) on the traditional battery (top bar) and computerized scan (bottom bars) indicated considerable uniformity of variance for both, with no significant differences between variances by Bartlett's test. The two batteries yielded similar profiles of sex differences with the exception that the traditional battery did not show a sex difference in attention; whereas, the computerized scan showed better performance for women. Within the memory measures the traditional battery showed a more pronounced superiority in females for oral memory; whereas, the measure of face memory, added to the computerized scan, also showed better performance in women. The accuracy and speed measures indicated that sex differences can be seen in specific aspects of performance. Although women were both more accurate and faster for the attention and face memory measures, men were more accurate but not faster for the spatial-processing measures.

Figure 2
figure 2

The neurocognitive profile of men and women (means ± SEM) on the traditional battery (top bar) and the computerized scan accuracy (middle), and speed (bottom), eurocognitive domains as defined in Methods

Age Effects

For the traditional battery, ABF, and VMEM showed significant decline with age, r = −0.31, and −0.24, respectively, df = 90, p < .01, one-tailed. An examination of the accuracy and speed measures provided by the computerized scan showed few correlations between age and accuracy, ABF, and SMEM, −0.49 and −0.31, respectively. By contrast, correlations between age and speed were evident in most domains including: ABF −0.46, ATT −0.25, FMEM −0.35, SMEM −0.23, LAN −0.24, SPA −0.26. all p < .01, one-tailed.

Order Effects and Correlations Between Traditional and Computerized Measures

No main effects or interactions of order of administration have approached statistical significance in the ANOVA for any of the functional domain scores. The correlations between the domain scores from the traditional battery and comparable “efficiency” scores from the computerized battery are presented in Table 3. As can be seen, the correlations between comparable domains (diagonal) are significant and higher than for correlations with other domains. Note that the correlations are moderate and for the “sensorimotor” domain the correlation is low.

Table 3

DISCUSSION

The computerized scan approach seems to be feasible with healthy people, who responded favorably, were able to follow instructions, and performed at over 70% accuracy levels. Participants who had no previous experience with computers could be trained using the training module. Furthermore, older adults had no difficulties with the display and procedures.

The computerized scan was at least as sensitive as the traditional battery to sex differences in cognitive performance. An index reflecting the hypothesized “female typical” gradient of superiority in memory relative to spatial processing showed a similar magnitude of sex differences for the computerized scan and the traditional battery. Both methods were able to provide the less stringent support for the hypothesis by showing a significant difference between men and women in the expected direction of women having more positive values than men. However, although the traditional battery provided the more stringent support of a significant within-sex deviation from zero only for men, the computerized scan met this criterion for both men and women.

A more detailed evaluation of the major neurocognitive domains indicated that, although verbal memory showed the largest difference favoring females on the traditional battery, the difference for word memory was marginal for the computerized scan. This probably reflects the inclusion in the traditional battery of the CVLT, which examines verbal learning. On the other hand, the face memory task from the computerized scan showed a robust difference favoring females. The computerized scan also revealed a sex difference in attention, with women showing better performance on this version of the CPT. Although poorer performance of men on attentional tasks seems consistent with the higher preponderance of attention deficit disorders in boys (Rhee et al. 1999), the issue of whether there are sex difference in attention, or whether they are stimulus-dependent, is still debatable (Dittmar et al. 1993; McGivern et al. 1997). Perhaps the addition of working memory demands in other versions of the CPT mask the better ability of women to attend vigilantly.

The advantage of the computerized scan in providing separate measures of speed and accuracy was manifested in the ability to characterize sex differences better in performance. Although women are both more accurate and faster than men for the face memory task, the better performance of women in attention is more pronounced for accuracy than for speed. In men, the better performance on spatial tasks was evident only for accuracy with speed being equal to that of women. These complementary aspects of performance may also be differentially linked to neural activation as can be assessed in functional imaging studies (Gur et al. 1988, 1997).

Both methods were about equally sensitive to the effects of healthy aging. Furthermore, as would be expected in this healthy population, the associations between cognitive performance and chronological age, although significant, were consistently small for both traditional and computerized measures. In both, aging showed an effect on measures related to frontotemporal functioning. Here again, however, the computerized scan had the advantage of enabling better specification of the facet of performance more vulnerable to aging. Consistently, the correlations with age were more pronounced for speed than for accuracy. This is compatible with other studies evaluating age effects on performance (Zec 1995). Thus, decline in performance that includes accuracy can be interpreted as reflecting processes distinguishable from speed-related decline associated with healthy aging.

Global measures of functional domains showed moderate correlations between the traditional and computerized measures, with the exception of the sensorimotor domain, where the measures were uncorrelated. Although the moderate correlations among comparable domains are encouraging, they are not sufficiently high to suggest that the approaches are interchangeable. Thus, correlations of 0.52 between the computerized and the traditional measures of executive functions and 0.53 between the respective measures of memory are insufficient to ensure an investigator who has been using the traditional battery that equivalent measures will be obtained with the computerized approach. However, these global measures tap areas that do not entirely overlap. The computerized tasks were designed to be used in functional imaging studies and were, therefore, targeted to very specific neurocognitive domains. The traditional battery, by contrast, consists of tests that were psychometrically designed to tap broader domains, emphasizing sensitivity to deficits associated with brain disorders. It is noteworthy that when individual computerized tasks, included in the present battery, have been directly compared with traditional measures, they yielded high correlations and reasonable construct validity (Glahn et al. 2000; Kurtz et al. 2001). Additional computerization of tasks not currently covered by the scan is needed to generate a set of measures more nearly equivalent to the traditional parameters.

The cognitive scan in its present form has several limitations. Most importantly, it does not incorporate the technology needed for assessing verbal learning as is traditionally done with the CVLT. We have evaluated necessary voice recognition technology and have not been able to configure the appropriate platform for such testing. The scan is also limited to measuring cognitive domains and does not include measures of emotion processing. However, computerized measures of facial affect discrimination have been developed (e.g., Kohler et al. 2000) and could be incorporated in future studies. Although other tests could also be added, the purpose of this scan is to provide a time-efficient, reliable, and error-free estimate of the neurocognitive profile.

The computerized scan approach seems to be ready for clinical and research applications. Although it still lacks measures requiring voice recognition, it is time efficient, easy to administer, and provides data that assess both accuracy and speed. The computerized format also facilitates data acquisition, transfer, and analysis, features that may be essential for large-scale collaborative studies. Finally, the availability of the worldwide web makes it convenient to upgrade and update versions for distribution, assuring uniform data collection. Our finding that the scan is sensitive to sex differences and age effects in healthy people suggests that it would be sensitive to the less subtle effects of neuropsychiatric illness. The ability to administer the computerized scan to young and older adults is encouraging, and we are currently evaluating the feasibility of its extension to children.