Evaluation of neurodegenerative disease progression may be assisted by quantification of the volume of structures in the human brain using magnetic resonance imaging (MRI). Automated segmentation software has improved the feasibility of this approach, but often the reliability of measurements is uncertain. We have established a unique dataset to assess the repeatability of brain segmentation and analysis methods. We acquired 120 T1-weighted volumes from 3 subjects (40 volumes/subject) in 20 sessions spanning 31 days, using the protocol recommended by the Alzheimer's Disease Neuroimaging Initiative (ADNI). Each subject was scanned twice within each session, with repositioning between the two scans, allowing determination of test-retest reliability both within a single session (intra-session) and from day to day (inter-session). To demonstrate the application of the dataset, all 3D volumes were processed using FreeSurfer v5.1. The coefficient of variation of volumetric measurements was between 1.6% (caudate) and 6.1% (thalamus). Inter-session variability exceeded intra-session variability for lateral ventricle volume (P<0.0001), indicating that ventricle volume in the subjects varied between days.
Background & Summary
Quantitative measurement of the volume of human brain structures using magnetic resonance imaging (MRI) may have application in the monitoring of neurodegenerative disease progression or the assessment of general structural abnormalities1,
The combination of freely available data and automatic segmentation software has the potential to make volumetric brain MRI an indispensible tool for both research and clinical medicine. It is now possible to compare quantitative brain volume measurements from a patient with control data from databases of normal subjects within only a few minutes10. Since brain volume measurements are a known biomarker for neurodegenerative diseases, such as Alzheimer’s disease2,11, this general approach may have great clinical value as a screening tool. Performing volumetric measurements at multiple time points in the same patient may also be a powerful means of measuring changes in the brain. Recent longitudinal results have shown that atrophy in medial temporal lobe (MTL) structures is predictive of subsequent memory decline in elderly subjects12. Additionally, performing longitudinal MRI studies improves predictive power in assessing the prognosis of individuals with mild cognitive impairment (MCI)13, and predicting the development of Alzheimer’s disease in non-demented subjects14 as well as MCI patients15.
We have begun to investigate the feasibility of employing the above approach at our institution. However, clinical implementation requires thorough validation, and one existing cause for concern regarding this methodology is that the repeatability (i.e., test-retest reliability) of individual measurements is still largely unknown. This prevents robust comparison of a single patient measurement against those in a database of normal subjects. It also confounds the analysis of longitudinal measurements from an individual, since it is crucial to know whether any apparent trend is simply due to measurement variation. That is, without knowing the fluctuations of our measurements in a single subject, we cannot be certain what disease effects we can detect. A number of confounding factors can be controlled, such as the scanner, imaging protocol, and software version used for data processing. MRI data are inherently noisy, however, and effects such as patient positioning and field of view selection are not constant between scans. There is also evidence of possible day-to-day variations in the brain structures of interest, for example due to hydration levels16,17 or the menstrual cycle18.
The goal of this study was to generate a publicly available test dataset to assist in the validation of the repeatability of current and future segmentation methods. We believe that this will be of great utility to other researchers who may not have the imaging facilities or resources to collect such a dataset. To demonstrate the utility of this dataset, we use it to obtain basic test-retest reliability information about the volume data corresponding to brain regions of clinical interest. This includes an analysis to test the hypothesis that test-retest reliability (repeatability) of measurements can be measured using multiple scans in a single session (intra-session) and that changes from day to day (inter-session) do not affect repeatability.
An experiment was designed to allow separate calculation of intra- and inter-session test-retest reliability (Figure 1). Human data collection was performed with the approval of the Stanford University Institutional Review Board (IRB) and all subjects gave their written informed consent. A total of 120 T1-weighted volumes were acquired from 3 subjects (40 scans/subject). Each subject was scanned twice on twenty different days within a 31-day period. Subjects were repositioned between the two scans in each session, and reregistered on the scanner console, so that all scans were treated as separate measurements (with a resulting break of ~5 min between scans). This ensured that confounding effects such as subject positioning were consistent between and within sessions.
We used a GE MR750 3T scanner (software version DV22.0_V02_1122.a, XRMB gradient set) and the ADNI-recommended T1-weighted imaging protocol for this system (accelerated sagittal 3D IR-SPGR, 27 cm FOV, 256 x 256 matrix slice, 1.2 mm slice thickness, standard 8-channel phased array head coil, TR: 7.3 ms, TE: 3 ms, TI: 400 ms, flip angle: 11 deg., acquisition time: 5 min 37 s). We chose to use the ADNI protocol9 because it exhibits excellent contrast between gray and white matter and has become standardized, so it is relevant to the research community.
Other data recorded throughout the study included subject weight, time of day, and phantom data for quality assurance and scanner stability. Both the ADNI phantom19 and the ACR (American College of Radiology) phantom were used for this purpose.
Following the guidelines of our IRB, written informed consent was obtained from each subject allowing their data to be shared online. The test-retest dataset is available through FigShare (Data Citation 1: FigShare http://dx.doi.org/10.6084/m9.figshare.929651). It consists of a single zip file containing the following:
A folder containing the brain imaging data, comprising 120 data files, separated by subject number into three folders. Each file contains an anonymized MRI brain volume in NifTI-1 format. Note that these files have not been processed by FreeSurfer in any way and are not corrected for B1 inhomogeneity or otherwise modified. Conversion from DICOM format was performed using a publicly available and often-used converter, known as ‘dcm2nii’, which is available online at http://www.mccauslandcenter.sc.edu/mricro/mricron/dcm2nii.html, together with extensive documentation. More information about the NIfTI file format is available from the Neuroimaging Informatics Technology Initiative webpage at http://nifti.nimh.nih.gov/.
A text file (‘anonymized_dicom_header.txt’) containing the anonymized DICOM header from Subject 2. Researchers can use this file to obtain any extra information about the scan protocol that is not adequately described in this manuscript.
Three files in Microsoft Excel format, indicating the day and time each image volume was acquired from each subject, the weight of each subject at the time of each scan, and the age and gender of each subject.
DICOM and NifTI files corresponding to 3D scans of the ADNI phantom and ACR phantom.
All 3D volumes were visually inspected at the time of acquisition to check for severe motion corruption or any other potentially confounding artifacts. No significant or unexpected artifacts were observed, and so no scans were repeated. Regular QA tests on the scanner did not indicate any performance issues during the 31-day period of the study. To further validate the technical quality of the dataset, and to demonstrate its utility, we processed and analyzed the data using the methods described in this section.
All 3D volumes were processed using FreeSurfer6,
We applied the following statistical analysis to the FreeSurfer output to assess variability for measurements obtained together on the same day (intra-session) and measurements from day to day (inter-session). Paired acquisitions allowed the intra-session variability to be computed using the expression for standard deviation from paired data21. The following analysis was applied to each brain structure of interest and for each subject. Denoting the standard deviation of intra-session measurements by σ s, then where and are the ith paired measurements over the m pairs (or sessions), evaluated separately for each subject and assuming that the variance is the same for all samples. Intra-session variability reflects manual repositioning differences, noise and segmentation errors, but not biological variations occurring from day to day. The total standard deviation, σ t, is computed across all 40 measurements from each subject.
In each case, the standard deviation is then converted to coefficient of variation (CV) expressed as a percentage, namely where CVs is the intra-session coefficient of variation, CVt is the total coefficient of variation, and is the arithmetic mean of all 40 volume values for the given structure and subject. We use the coefficient of variation since the normalization by the mean volume is somewhat more intuitive than absolute standard deviations. This is particularly the case when comparing between the segmentation repeatability of different structures, which may have very different volumes. Finally, CV s and CVt data from each subject were pooled by taking the root-mean-square average, as proposed by Gluer et al. 22, since taking the arithmetic mean directly is not valid.
A Monte Carlo permutation test23 was used to evaluate observed differences between CVt and CV s for each structure under the null hypothesis of ‘no difference’, i.e., that CVs=CVt . In brief, the scan order of all 40 volumes for each subject is randomly permuted N times. The above analysis is repeated for each permutation and the value obtained for |CVs−CVt| is recorded. The two-sided P-value is then calculated as the proportion of the N permutations where the absolute difference |CVs −CVt| was equal to or greater than the value obtained with the measured data. In this work, we used N=100,000.
Data analysis results
Figure 2 shows typical segmentation results obtained using FreeSurfer. The volumetric data segmented in this example were acquired from Subject 2 on Day 1 of the study. Despite generally good segmentation accuracy, minor differences are visible between the two scans from one session. This is to be expected, since the input data differ slightly due to noise, subject positioning, and other confounding factors.
Quantitative differences between segmentation volumes are shown in Figure 3 for several brain structures from Subject 2. For the thalamus, caudate, pallidum and hippocampus, the differences between each scan visually appear to be random. However, in the case of the lateral ventricles it is evident that inter-session variance exceeds intra-session variance, since values obtained on the same day (neighboring bars in the plot) are typically closer in magnitude to each other than values obtained on different days.
Results from the numerical analysis show total coefficient of variation values between 1.6% (caudate) and 6.1% (thalamus) for the structures analyzed (Table 1). The mean structure volume appears to have little influence on the coefficient of variation value. The lateral ventricles are the only structures showing a statistically significant difference between intra- and inter-session variability. This difference is also substantial: accounting for variance between sessions more than doubles the total coefficient of variation for lateral ventricle volume from 1.6% to 3.4%.
Based on the results shown in Table 1, one might expect that true ventricle size changes are occurring between days. To investigate this further, registered volumes were compared directly. Figure 4 shows three orthogonal slices from Subject 2, acquired at the time of the maximum (Day 7, Measurement 2, shown in top row) and minimum (Day 18, Measurement 1, shown in bottom row) lateral ventricle volume. The reduced ventricle size is visible, particularly when the images are viewed dynamically (see video available through FigShare (Data Citation 1: FigShare http://dx.doi.org/10.6084/m9.figshare.929651)).
In order to explore the apparent change in ventricle size, we investigated the relationship between the lateral ventricle volumes and (a) subject weight at the time of each scan, (b) image signal-to-noise ratio (SNR), and (c) time of day of scan (Figure 5). No correlation with weight or image SNR (Supplementary Figure 1) was observed, but in the case of time of day, a statistically significant positive correlation (r=0.66, P=0.0016) was observed for Subject 3, but not for Subjects 1 or 2.
The analysis and results presented in this work demonstrate how the test-retest dataset can be used to determine the reliability of brain volume measurements. The coefficient of variation (CV) values shown in Table 1 indicate the repeatability of the volumetric measurements obtained using our FreeSurfer-based pipeline. The CV values shown (from 0.9% to 6.1%, depending on the structure) could be used for effect size calculations for volumes of individual structures in a study using the same processing pipeline. However, we did not attempt to perform a comprehensive valuation of FreeSurfer itself, or to compare its output to that from other software packages or manual segmentation, as the reliability of FreeSurfer has already been evaluated on numerous occasions24,
The main unique aspect of our test-retest dataset is the large number of scans performed on each subject over a short period (one month) and that two back-to-back scans were performed within each session, with the patient repositioned between each scan. This can be used to separate intra- from inter-session variation, using the methods demonstrated in this work. In particular, the data indicate that variance between days exceeds variance within a session for lateral ventricle volume and that this is due to actual volume changes of the ventricles between days, rather than measurement errors.
The observed ventricle volume changes may be due to subject hydration effects, since dehydration can cause shrinkage of brain tissue and enlargement of ventricle volume16,17. But unlike these previous studies, our subjects were not required to deliberately become dehydrated or to consume excess fluid, and we did not observe a reduction in body mass correlating with ventricular volume. We refrained from using a hydration protocol, as it is not standard practice to subject patients to such a measure for routine imaging. Despite this, the maximum observed ventricle volume changes in our study were of similar magnitude to that found in other studies where subjects underwent a ‘thermal-exercise dehydration protocol’17,27. It is therefore interesting to note that these effects also occur in normal subjects, and it is possible that other confounding factors are affecting water balance in the brain. The time of day did not appear to be a strong predictor of this effect (Figure 5). Regardless of the cause, this effect may need to be considered when looking for signs of ventricular enlargement using MR-based methods such as VIENA28. It is unclear whether the effect would be more pronounced in older subjects where ventricles are larger. While larger ventricles are potentially easier to segment, leading to less error in the volume estimates, absolute volume changes from day to day would lead to a smaller percentage change.
Although the lateral ventricles were the only structures where we were able to reject the null hypothesis of identical intra- and inter-session variance, a ‘conservation of volume’ assumption within the skull would suggest that volume changes should occur in other brain structures to compensate for the ventricle volume gain or loss. Perhaps future advances in segmentation methods will allow these changes to be detected in this dataset. Nonetheless, the data indicate that changes from day to day are likely to be very small for structures other than the ventricles.
Regular measurements of the QA phantoms and other standard scanner QA measurements gave no evidence of scanner-related variation over the 31-day period. We have also not observed any changes in the brain volume data that would indicate this was a limiting factor in the study. A more important limitation is that the subjects (healthy, age 26–31) are not representative of the general population or any specific patient group. With only three subjects, it is also possible that the inter-scan variability does not hold for a larger population. Nonetheless, the data are sufficient to perform basic test-retest reliability analysis, which was the goal of the work.
Our experimental protocol differed significantly from that used in previous studies examining test-retest reliability29,30, since our subjects were imaged many times, rather than on only two occasions. While we specifically aimed at comparing intra- and inter-session variation, previous work has largely focused on the influence of confounding factors such as pulse sequence, imaging geometry, parallel imaging, scanner drifts and choice of segmentation software. We have collected a unique dataset, complimentary to existing studies, which we hope will become a valuable tool to assess repeatability of segmentation methods.
How to cite this article: Maclaren, J. et al. Reliability of brain volume measurements: A test-retest dataset. Sci. Data 1:140037 doi: 10.1038/sdata.2014.37 (2014).
Maclaren, J., Han, Z., Vos, S. B., Fischbein, N. & Bammer, R. FigShare http://dx.doi.org/10.6084/m9.figshare.929651 (2014).
We thank Christoph Seeger, Alexander Brost and Norbert Schuff for valuable discussions and Andrew Gentles for his advice concerning the statistical methods used in this work. We would also like to acknowledge the generosity of the authors of FreeSurfer (Anders Dale, Bruce Fischl and the entire development team) for making their software freely available to the public.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/4.0/ Metadata associated with this Data Descriptor is available at http://www.nature.com/sdata/ and is released under the CC0 waiver to maximize reuse.