Reliability in adolescent fMRI within two years – a comparison of three tasks

Longitudinal developmental fMRI studies just recently began to focus on within-subject reliability using the intraclass coefficient (ICC). It remains largely unclear which degree of reliability can be achieved in developmental studies and whether this depends on the type of task used. Therefore, we aimed to systematically investigate the reliability of three well-classified tasks: an emotional attention, a cognitive control, and an intertemporal choice paradigm. We hypothesized to find higher reliability in the cognitive task than in the emotional or reward-related task. 104 healthy mid-adolescents were scanned at age 14 and again at age 16 within M = 1.8 years using the same paradigms, scanner, and scanning protocols. Overall, we found both variability and stability (i.e. poor to excellent ICCs) depending largely on the region of interest (ROI) and task. Contrary to our hypothesis, whole brain reliability was fair for the cognitive control task but good for the emotional attention and intertemporal choice task. Subcortical ROIs (ventral striatum, amygdala) resulted in lower ICCs than visual ROIs. Current results add to the yet sparse overall ICC literature in both developing samples and adults. This study shows that analyses of stability, i.e. reliability, are helpful benchmarks for longitudinal studies and their implications for adolescent development.

To date, the field of longitudinal developmental fMRI studies is growing 1 . However, it remains largely unclear which degree of quantitative reliability can be achieved in developmental studies.
The preferable quantitative reliability measure in fMRI studies is the intraclass coefficient (ICC 2 ) with the following formula: between error between error The total sum of squares in this model is split into between-subjects (MS between ) and error (MS error ) mean sums of squares and k represents the number of obvervations 3 . The ICC ranging from 0 to 1 tells us how much variance from the total variance in two measurements is due to variance between participants. An ICC of 1 would imply that participants' brain activation does not change over time (no within-subject variance). ICCs are classified according to Cicchetti 4 as poor (<0.40), fair (0.41-0.60), good (0.61-0.75), and excellent (>0.75) 5 . So far, almost exclusively adult neuroimaging studies measured reliability and found large variance across studies with an average ICC of 0.5 2 . According to Cicchetti 4 , this ICC can be classified as 'fair' . These methodical studies measured only small samples of 10 to 20 adults in a short time span from a few days to a few weeks 2 .
However, it remains largely unanswered if these test-retest reliabilities can be generalized to typical developmental longitudinal samples, which usually span larger time intervals between measurements. There have been only two previous developmental studies that reported ICCs 6,7 . Van den Bulk et al. 7 investigated n = 20 12 to 19 year-old adolescents and obtained fair reliability for the prefrontal cortex (PFC) and poor reliability for the amygdala using an emotional faces task. Koolschijn et al. 6 used a cognitive rule-switch task and showed fair to FMRI reliability. Whole brain ICCs. The whole brain ICC of the reward paradigm was highest across paradigms, 0.74 (see Fig. 1), and together with the emotional attention paradigm, 0.62 (see Fig. 2), it was in the "good" range. The ICC of the cognitive control paradigm was lower and only in the fair range, 0.44 (see Fig. 3). An ANOVA showed that the whole brain reliability differed significantly between the paradigms (F = 102.67, p < 0.001, η 2 partial = 0.499) with post-hoc analyses revealing that whole brain reliability of the reward paradigm was higher than emotional attention, which was higher than cognitive control (with all paradigms differing significantly from another, p's < 0.001).
ICCs of different ROIs. For the task-based ROIs in the emotional attention paradigm, ICCs were poor (amygdala, IFG, medial prefrontal cortex, mPFC) to excellent (fusiform gyrus, FG) ranging from 0.19 to 0.84 and poor for the development-based ROIs (ACC, IFG) ranging from 0.17 to 0.29 (see Fig. 2). A Wilcoxon signed-rank test revealed that the task-based IFG ROI was higher than the development-based IFG ROI (p = 0.002 for the left IFG and p = 0.001 for the right IFG). For the cognitive control paradigm, ICCs ranged from 0.32 to 0.56. ICCs were thus poor to fair for the dlPFC and dACC and fair for the PPC (see Fig. 3). The intertemporal choice paradigm yielded poor to fair ICCs for the VS and ACC ranging from 0.32 to 0.52 and excellent ICCs for the superior parietal lobe (Par-Sup) and the FG ranging from 0.81 to 0.89 (see Fig. 1). The control region in the occipital cortex (superior occipital lobe, Occ-Sup) yielded good to excellent reliability across paradigms.

Discussion
The current study aimed at investigating reliability in a large sample of mid-adolescents in three important domains of information processing using an emotional, a cognitive, and a reward-related task. We also considered different ROIs while holding other parameters that might influence reliability constant. Results showed that behavioral reliability was fair for all three paradigms. For fMRI reliability, the cognitive task yielded only fair whole brain reliability whereas the emotional and the reward-related task showed good whole brain reliability. ICCs of ROIs depended largely on the specific regions and the task and ranged from poor to excellent. Overall, ICCs were comparable to previous adult studies 2 .
In terms of behavioral reliability from age 14 to 16, we found fair to good ICCs. For the emotional and the cognitive task, the participants became faster, while the reward-related task indicated less impulsive behavior, which is in line with previous studies [27][28][29] . Low behavioral reliability can be expected for tasks with developmental changes.
For our first aim, to explore whether functional imaging reliability depends on the task domain, the whole brain ICC was chosen. This ICC has been suggested to be the strictest approach for reliability 2 , because it assumes on a whole-brain basis that the level of activity in all voxels should remain constant irrespective of suprathreshold activation. All three paradigms obtained a fair to good whole brain reliability. The reward paradigm had the highest whole brain ICC of 0.74 which can be classified as good to excellent. The emotional task had a whole brain ICC of 0.62 that was significantly lower but still in the good range. The ICC of the cognitive task differed significantly and was only in the fair range (ICC = 0.44). Thus, our first hypothesis, that the cognitive task would show higher reliability than the emotional or reward-related task (see also ref. 1) was not supported. To the contrary, the reward-related task yielded highest reliability followed by the emotional and the cognitive task. Our findings do not support the conclusions of Plichta et al. 8 , probably because they measured adults and investigated the amygdala only for their emotional task while we also investigated whole brain ICCs. Speculatively for the cognitive paradigm, the low behavioral reliability might probably be related to the low fMRI reliability. However,  when exploring correlations of behavioral and fMRI ICCs we did not find such a relationship (see S4 in the supplements). Furthermore, the lower whole brain ICCs of the cognitive control paradigm could stem from lower ICCs in the occipital regions (0.61 and 0.61 as opposed to 0.79, 0.73, 0.84 and 0.81 for the other two paradigms), respectively higher ICCs in the emotional attention paradigm for lower processing regions such as the IFG. The conclusion regarding lower reliability in the cognitive control paradigm has thus to be taken cautiously and investigated further in future studies. Regarding our second aim, the single analyzed ROIs, the control region in the occipital cortex yielded good to excellent reliability across paradigms. The high reliability for the occipital cortex in the emotional paradigm is in line with another adolescent study 7 . The rather low-level visual area fusiform gyrus also yielded excellent reliability in both the emotional and reward task in line with previous emotional adult studies 30,31 . In contrast, other regions that are relevant for cognitive or emotional-motivational processes such as subcortical (amygdala, VS) and cortical regions (PFC) showed low reliability. Taken together, the current study suggests that across three tasks in the same sample reliabilities might be higher in regions of basic visual processing compared to cognitive or emotional-motivational brain regions. This might be due to higher variability in higher-level cognitive  processes than basic visual processing 32 . Another explanation might be that developmentally, visual regions have already matured, while subcortical and cortical higher-level regions continue to develop in adolescence 33,34 .
In the following the regions that are relevant for cognitive or emotional-motivational processes are discussed for each paradigm separately.
For the emotional attention paradigm we found poor amygdala ICCs. Only one previous study investigated adolescent amygdala reliability with an age-heterogeneous sample of n = 20 12 to 19 year-olds 7 and found poor reliability within a short interval of 3 months. Our results show that poor amygdala reliability is also evident in a large sample of mid-adolescents within a longer time interval of 2 years.
From a developmental perspective, current results can be integrated with previous findings of a potential peak in amygdala activation in mid-adolescence compared to child-and adulthood (for a review, see refs 1 and 33). While some previous cross-sectional studies have supported this amygdala peak 35,36 , longitudinal studies rather indicated "relative stability" in amygdala activation across mid-adolescence 9,16 . The current sample is a sub-sample of our previous longitudinal study that did not find amygdala activation change from age 14 to 16 16 . Therefore, current results suggest that this "relative stability" and lack of peak in mid-adolescence might occur at the same time as intra-individual variability, i.e. low reliability in amygdala activation (in accordance with the conclusions of a recent review) 1 .
It is also possible that the amygdala signal itself might be instable, independent of development 1 . This is supported by adult studies that also found poor to fair amygdala ICCs in emotional tasks 8,25,26,30,31 .
Regarding frontal regions important for emotional processing 16 , the first region IFG showed poor reliability in line with a previous emotional adult study 31 . An emotional adolescent study found that IFG activation at baseline correlated with activation 2 years later indicating some degree of reliability 37 . The second region, mPFC, showed poor reliability similar to the adolescent study of van den Bulk et al. 7 . In our previous longitudinal study 16 , part of the IFG and the ACC demonstrated a developmental effect, i.e. higher activation at age 16 than 14. Expectedly, this developmental region showed a lower reliability than the (larger) IFG ROI that was functionally defined at age 14. The ACC showed a poor reliability similar to an adult study 31 .
The cognitive control paradigm showed poor to fair ICCs partly in contrast to the only other adolescent study 6 that found good ICCs for the PPC and dACC while the dlPFC result was in a similar fair range. But it should be noted, that the ACC of Koolschjin et al. 6 was located more anteriorly. Also an adult study found good to excellent ICCs 38 . However, there are not many studies that have calculated ICCs in cognitive control tasks. Cognitive control can be divided into three related factors: inhibition, shifting, and updating 39 . The current interference and switch task assesses both inhibition and shifting. No previous study examined ICCs using such a task. Taking updating tasks into account, current results are in line with ICC ranges of adult studies (Plichta et al. 8 using an n-back task, Brandt et al. 40 using a memory encoding task, and Bennett and Miller 41 using an episodic and two-back memory task). We speculate that ICCs in our task may be low, as it assesses two cognitive control functions simultaneously. Unfortunately, due to our task design it is not possible to separate both components of cognitive control (i.e. task switching and overcoming incongruence) because each trial contains information on incongruence as well as task switching. Future studies should systematically compare ICCs of more basic cognitive control tasks.
To our knowledge this is the first study that tested reliability of a reward-related paradigm in an adolescent sample. The intertemporal choice paradigm showed fair to good ICCs in the superior parietal lobe and the ACC, which is in line with previous adult studies (probabilistic reversal task 42 ; classification learning task 43 ). For the VS, our results were in the poor to fair range, which is in line with Chase and colleagues 44 using a card guessing task re-scanned within one week. In contrast, Plichta et al. 8 found excellent ICCs in the VS for a reward task within two weeks. Our findings of low VS reliability are in line with the conclusions by Crone & Elzinga 1 that there might be large variability in subcortical brain regions (amygdala, VS) in adolescence.
The reliability of fMRI data has implications for longitudinal studies of reward processing, which are pivotal to detect developmental change in brain-behavior relations. For example, Braams et al. 45 assessed response to rewards in participants aged 8 to 25 longitudinally within 2 years and found an inverted U-shaped activation of the VS with a peak in activation during adolescence. This peak was also found behaviorally in a balloon analog risk taking task. A further longitudinal study was able to extent knowledge about dynamics of reward anticipation on the brain and behavioral level in adolescents 11 . Results showed that changes in VS activation over 2 years were related to changes in the behavioral approach system fun seeking score 46 during the same time period. A third longitudinal study found increasing dorsal striatal activation from mid-adolescence to late-adolescence/early adulthood in response to anticipation of gain and loss 12 . Taken together, reliability of reward-related activation seems to depend on time between measurements and brain regions. While ICCs of cortical areas were mostly good to excellent, the results regarding the subcortical area VS are not conclusive. Additionally, ICCs have to be interpreted with respect to expected developmental-related changes regarding activation patterns. Thus, additional studies are needed to systematically investigate this relationship.
Overall, current results warrant discussion with regard to the following considerations and limitations. The ICC depends on the between-subject variance. Thus, current results might be related to the type of the current sample that is rather homogenous (fine-grained age range, similar sociodemography, intelligence, and pubertal status). Future studies could test reliability using more heterogenous samples.
Similar to other adolescent reliability studies 6 this study was not designed a priori as a methodological study that investigates reliability but part of an overall research project focusing on adolescent brain development in several domains. The large sample size spanning about 200 participants (before exclusion due to movement, technical or behavioral outliers, see S1 in the supplement) required a time span of about 2 years. Because of this time span and the developmental sample we can therefore not disentangle between reliability due to development or reliability which would have occurred without development (e.g. in an adult population).
Scientific RepoRts | 7: 2287 | DOI:10.1038/s41598-017-02334-7 Assuming that changes in brain processes will be more likely to occur in contrasts which are expected to be effected by development (i.e. specific contrasts, like decision for small immediate vs. larger later in the intertemporal choice task), we used more general contrasts to investigate the reliability of the imaging data in our large sample. Although reliability and developmental changes are not two sides of the same coin, both are harder to distinguish the more developmentally sensitive the contrast is. Therefore, our rational was that, if the reliability of the more general contrasts would be moderate to high, the imaging data per se might be reliable; in the current study even over a timespan of two years.
As this area is still controversial, we chose baseline contrasts after careful consideration, since their constancy allowed us to compare single conditions of different paradigms more clearly as opposed to two contrasted conditions per paradigm. Especially in the developmental literature, the importance of differentiating between baseline and higher level contrasts has been emphasized 1, 47 to infer more precisely which contrast led to developmental effects: in case of developmental changes in a higher level contrast, it is not possible to conclude what has changed: condition A, condition B, or both 1,47 . Furthermore, it has been suggested that baseline contrasts yield better reliability than higher level contrasts 8 . However, current results have to be considered carefully and with potentially lower ICCs for higher level contrasts in mind.
Nevertheless, the study is unique due to its large sample and the three tasks that were tested for reliability. Future studies could systematically assess reliability in a (smaller) adolescent sample within a short time span and at the same time systematically control for potential changes in several domains (development, cognitive strategy, motivation etc.) and compare tasks that show developmental change in adolescence and those which do not. The reliabilities could further be compared to an additional adult population.
This study contained a qualitative comparison between tasks and was not designed a priori to systematically compare reliabilities of parallelized tasks. There were several aspects that could not be controlled for in the current analyses. First, the number of specific trials for the chosen contrast differed between tasks. While the task with the highest amount of trials was the most reliable one, the emotional attention task had fewer trials than the cognitive control task but a higher reliability, which might not fit to the conclusion that amount of trials correlates with task reliability. Second, behavioral differences that might stem from changes in performance, cognitive strategy or task focus [48][49][50] could not be controlled for. Third, the implicit baseline that was included in all regressors of interest differed between tasks (length of fixation cross and cognitive process during baseline). Fourth, due to each paradigm's specific effect size functional ROIs were created specifically for each paradigm: the statistical thresholds for the second-level analyses that built the basis of the definition of the functional ROIs differed between paradigms as well as the approach to rely on the peak voxels (emotional attention, cognitive control) or the anatomical overlap (intertemporal choice). Future studies should hold these features between tasks constant or control for them to be able to systematically compare task domains without potential confounders. ROIs were defined on the group level instead of the individual level similar to other studies 20, 51-55 . Future studies could also add ROIs based on the individual level and calculate reliability.
Taken together, ICCs in each paradigm were largely dependent on the respective ROIs with subcortical ROIs (VS, amygdala) resulting in lower ICCs than visual ROIs. The emotional and reward paradigm had higher whole brain ICCs than the cognitive paradigm. Current results add to the yet sparse overall ICC literature in both developing samples and adults. In the different task domains, ICCs were similar as in adult studies. To test whether results are specific for adolescents or can be generalized to adults the current paradigms could be tested in adults. Analyses of stability, i.e. reliability, are helpful benchmarks for longitudinal studies and their implications for adolescent development.

Material and Methods
Participants. The institutional review board of the medical faculty of the TU Dresden approved the study and the study was realized in accordance with it and with the Declaration of Helsinki. Participants were recruited from local schools and received monetary compensation for their participation. Written informed consent was obtained from both the participants and one of their legal guardians. The current dataset stems from the overall project "The adolescent brain" 22 that investigated 250 adolescents at age 14 and again at age 16. For technical and practical issues not all of these participants completed all three tasks at both time points.
Sub-populations of this sample were previously reported regarding cross-sectional analyses of age 14 (emotional attention task, n = 164, Pilhatsch et al. 15 , intertemporal choice task, n = 235, Ripke et al. 22 ; n = 206, Ripke et al. 56 , cognitive control task, n = 184, Mennigen et al. 17 , Rodehacke et al. 18 ) or longitudinal change from age 14 to 16 (emotional attention task, n = 144, Vetter et al. 16 , intertemporal choice task, n = 80, Ripke et al. 23 ). We here report on the overlapping sample of 104 healthy participants who performed all three tasks at age 14 and 16 successfully. This sample was analyzed for reliability for the first time.
For information of exclusion criteria for each task see Supplement S1. Participants had normal or corrected to normal vision and neither any record nor any current diagnoses of neurological, psychiatric, or serious medical disorders. Current psychiatric disorders were identified with the Development and Well-Being Assessment (DAWBA 57 ). General cognitive ability of the sample was in the average to above average range (IQ across both time points: M = 115; SD = 10; range = 89-139) and did not change between measurements (t = 1.03; p = 0.31). 76.7% of the participants were visiting the higher grammar school (German "Gymnasium") and 23.3% the lower grammar school (German "Mittelschule"). Parental education ranged from no school education (7) to doctoral degree (1) with an average education of M = 3.38 (SD = 1.45), representing a university diploma. For further details about the sample see Table 2. A urine test assured no use of illicit drugs (e.g. cannabis, heroin, cocaine) at the day of assessment.

Paradigms.
For an overview of the main characteristics of the three paradigms see Table 3. In the emotional attention task, participants had to decide whether a pair of visual target stimuli was identical or not while another pair was presented as a distractor. Participants were not asked to attend to a particular emotional category but cued spatially by an arrow pointing in the direction of the two stimuli. Each trial consisted of a pair of pictures from one of three emotional categories (positive, neutral, negative) and a pair of non-emotional pictures. The emotional pictures were taken from the International Affective Picture System (IAPS 58 ); and the non-emotional pictures were created by shredding the chosen IAPS pictures with GIMP (www.gimp.org). For further details see Vetter et al. 16 and Pilhatsch et al. 15 and Supplement S2.
The first screen of the cognitive control task was an arrow consisting of two triangles pointing in one (left, right, up or down) direction and a red dot located either at the tip or the tail of the arrow. Participants were instructed to move a joystick in the direction indicated by the arrow or the dot. The shape of the background served as a task cue: If the background was rectangular, participants had to move the joystick in the direction of the arrow and ignore the position of the dot; conversely, if the background was circular, participants had to respond to the position of the dot while ignoring the arrow direction. Stimuli could be congruent, i.e. dot and arrow were pointing in the same direction, or incongruent, i.e. the dot and the arrow were pointing in opposite directions. For further details see Mennigen et al. 17 , Rodehacke et al. 18 .
In the intertemporal choice task participants had to choose between a larger later reward, which changed from trial to trial and a fixed immediate reward, which was instructed beforehand but not shown during scanning. In the current paper, the contrast of interest was the phase of the presentation of the potential later reward, i.e. the intertemporal decision phase, which refers to the process of comparing both alternatives in a given trial (fixed immediate or later reward). The task started with a behavioral training session to estimate the individual   impulsivity parameter k, which was used to adapt the scanning paradigm to the subjects' impulsivity. For more details see Ripke et al. 22 and Ripke et al. 56 .
Task presentation and order. The paradigms were presented with a LCD-based display system which was mounted on the head-coil (NordicNeuroLab AS, Bergen, Norway). Behavioral data were collected with a joystick (Resonance Technology Inc., Northridge, CA, USA) for the cognitive control task and by ResponseGrips (©NordicNeuroLab) with a button on a grip in each hand for the emotional attention and intertemporal choice task. Task presentation and recording of the behavioral responses was performed using Presentation ® software (version 11.1, Neurobehavioral Systems, Inc., Albany, CA). Each task was preceded by a practice session. Since the tasks were assessed within an overall project including a large behavioral and fMRI battery, the order of tasks varied slightly between time points. At age 14, the order of paradigms was emotional attention, cognitive control and intertemporal choice on three different days within two weeks. At age 16 first the cognitive control and then the intertemporal choice task were assessed on the same day followed by the assessment of the emotional attention task within two weeks. Analysis of fMRI data. FMRI data analyses were performed using SPM5 (Wellcome Trust Center of Neuroimaging, London, UK) and were the same for both time points per paradigm.
Preprocessing. For preprocessing, which was identical for all three tasks, functional images were first slice-time corrected by using the middle slice as reference and realigned to the first image (by 6° rigid spatial transformation). Afterwards they were spatially normalized into Montreal Neurological Institute (MNI) space and spatially smoothed with an 8 mm full-width half maximum Gaussian kernel.
Statistical analysis. For all paradigms first-level contrasts were computed with a fixed effects analysis for each participant based on the general linear model by modeling the different conditions as regressors of interest within each voxel for the whole brain. For each paradigm, the six subject-specific movement regressors, which were derived from the rigid-body realignment, were included as covariates of no interest. A high-pass filter with cut-off 128 s was applied to remove the low frequency physiological noise 59 for each paradigm. Also an autoregression, AR(1), model was employed for the residual temporal autocorrelation 59 for each paradigm. Contrasts of interest (see Table 3) were computed for each paradigm within each subject. The first-level contrast images from the weighted beta-images were used for second-level whole brain random-effects analyses to allow for population inference. For a detailed description of the first-and second-level analyses for each paradigm see S3 in the supplement.

Definition of ROIs.
For an overview of used ROIs see Fig. 4. ROIs were defined based on a priori hypotheses regarding activation in the respective tasks and based on functional masks resulting from the whole-brain analyses of each task at the first time point, i.e. age 14 16,17,22 . 10 mm spheres were placed around the peak coordinates (see Table S3 in the Supplementary Materials) of the whole brain analyses at age 14 and thus final ROIs created. Additionally, bilateral superior occipital ROIs using the WFU-PickAtlas with the Automated Anatomical Labeling Atlas (AAL) were created that served as control regions for all three tasks. Specific ROI approaches for each paradigm are described in the following.
Emotional attention paradigm. For this paradigm, we focused on attending negative versus attending neutral stimuli for functional ROI extraction for two reasons: The attending negative in contrast to the attending neutral condition resulted in slower reaction times which indicates an attentional capture effect 16 . Second, separate ROIs for emotional attention could be created by subtracting the neutral contrast (but not by subtracting the implicit baseline since almost the whole brain was activated). The amygdala was chosen as an additional ROI because it was also activated for negative target stimuli in the paradigm but defined the whole amygdala as a larger cluster anatomically using the WFU-PickAtlas with the Talairach Daemon (TD) Brodman atlas (following 15,16 ). Furthermore, for this paradigm, two ROIs with developmental effects were analyzed that emerged from higher activation during presentation of emotional target and distractor stimuli for age 16 versus 14 in the right and left inferior frontal gyrus (IFG) and the ACC 16 , see Table S3 in the Supplementary Materials.
Cognitive control paradigm. ROIs were created based on a conjunction analysis 17 . Switch-and incongruence-related activity overlapped in bilateral dACC, dlPFC and PPC. We chose trials with co-occurrence of incongruence and switch (switch incongruent trials > implicit baseline) because of two reasons. These trials led to a steep increase in reaction time and error rate therefore reflecting a high level of cognitive control 17 . Further, task switch and incongruence trials robustly and independently activated the core regions of the cognitive control network 17 .
Intertemporal choice paradigm. For this paradigm, ROIs of the fusiform gyrus, the superior parietal lobe as well as the ACC were created by using the overlap of functional activation of the intertemporal decision phase 22,56 and anatomical regions using the WFU-PickAtlas with the AAL atlas. The overlap with anatomical regions was necessary to create distinct ROIs because the activation spanned one very large cluster across the whole brain. We additionally chose the VS as a ROI since it is highly relevant for reward paradigms. The anatomical ROIs of the VS were created with the WFU-PickAtlas using the AAL atlas.
Analyses of reliability. Behavioral reliability. Behavioral ICCs (3,1) were calculated using SPSS v21 (IBM Corp., Armonk, USA). For the emotional attention and the cognitive control paradigm, reaction times of the specific conditions and overall reaction times across conditions and for the intertemporal choice paradigm, log-transformed discount parameters were analyzed for reliability.
FMRI reliability. FMRI ICCs were calculated with the ICC toolbox of Caceres et al. 60 . We used the intra-voxel reliability "ICC v " obtained by using the contrast value of each voxel within each ROI of each individual subject. The population estimate was obtained by bootstrapping with 1,000 re-samples of participants, of which medians and standard errors are reported. Additionally, whole brain ICCs were calculated, since this is the strictest criterion and potentially the most valuable reliability measure as it yields a global measurement of test-retest agreement 2 . ICCs were classified according to Cicchetti 4