An open science resource for establishing reliability and reproducibility in functional connectomics

Efforts to identify meaningful functional imaging-based biomarkers are limited by the ability to reliably characterize inter-individual differences in human brain function. Although a growing number of connectomics-based measures are reported to have moderate to high test-retest reliability, the variability in data acquisition, experimental designs, and analytic methods precludes the ability to generalize results. The Consortium for Reliability and Reproducibility (CoRR) is working to address this challenge and establish test-retest reliability as a minimum standard for methods development in functional connectomics. Specifically, CoRR has aggregated 1,629 typical individuals’ resting state fMRI (rfMRI) data (5,093 rfMRI scans) from 18 international sites, and is openly sharing them via the International Data-sharing Neuroimaging Initiative (INDI). To allow researchers to generate various estimates of reliability and reproducibility, a variety of data acquisition procedures and experimental designs are included. Similarly, to enable users to assess the impact of commonly encountered artifacts (for example, motion) on characterizations of inter-individual variation, datasets of varying quality are included.


Background & Summary
Functional connectomics is a rapidly expanding area of human brain mapping [1][2][3][4] . Focused on the study of functional interactions among nodes in brain networks, functional connectomics is emerging as a mainstream tool to delineate variations in brain architecture among both individuals and populations [5][6][7][8] . Findings that established network features and well-known patterns of brain activity elicited via task performance are recapitulated in spontaneous brain activity patterns captured by resting-state fMRI (rfMRI) [3][4][5][6][9][10][11][12] , have been critical to the wide-spread acceptance of functional connectomics applications.
A growing literature has highlighted the possibility that functional network properties may explain individual differences in behavior and cognition 4,7,8 -the potential utility of which is supported by studies that suggest reliability for commonly used rfMRI measures 13 . Unfortunately, the field lacks a data platform by which researchers can rigorously explore the reliability of the many indices that continue to emerge. Such a platform is crucial for the refinement and evaluation of novel methods, as well as those that have gained widespread usage without sufficient consideration of reliability. Equally important is the notion that quantifying the reliability and reproducibility of the myriad connectomics-based measures can inform expectations regarding the potential of such approaches for biomarker identification [13][14][15][16] .
To address these challenges, the Consortium for Reliability and Reproducibility (CoRR) has aggregated previously collected test-retest imaging datasets from more than 36 laboratories around the world and shared them via the 1000 Functional Connectomes Project (FCP) 5,17 and its International Neuroimaging Data-sharing Initiative (INDI) 18 . Although primarily focused on rfMRI, this initiative has worked to promote the sharing of diffusion imaging data as well. It is our hope that among its many possible uses, the CoRR repository will facilitate the: (1) Establishment of test-retest reliability and reproducibility for commonly used MR-based connectome metrics, (2) Determination of the range of variation in the reliability and reproducibility of these metrics across imaging sites and retest study designs, (3) Creation of a standard/benchmark test-retest dataset for the evaluation of novel metrics.
Here, we provide an overview of all the datasets currently aggregated by CoRR, and describe the standardized metadata and technical validation associated with these datasets, thereby facilitating immediate access to these data by the wider scientific community. Additional datasets, and richer descriptions of some of the studies producing these datasets, will be published separately (for example, A high resolution 7-Tesla rfMRI test-retest dataset with cognitive and physiological measures 19 ). A list of all papers describing these individual studies will be maintained and periodically updated at the CoRR website (http://fcon_1000.projects.nitrc.org/indi/CoRR/html/data_citation.html).

Experimental design
At the time of submission, CoRR has received 40 distinct test-retest datasets that were independently collected by 36 imaging groups at 18 institutions. All CoRR contributions were based on studies approved by a local ethics committee; each contributor's respective ethics committee approved submission of deidentified data. Data were fully deidentified by removing all 18 HIPAA (Health Insurance Portability and Accountability)-protected health information identifiers, and face information from structural images prior to contribution. All data distributed were visually inspected before release. While all samples include at least one baseline scan and one retest scan, the specific designs and target populations employed across samples vary given the aggregation strategy used to build the resource. Since many individual (uniformly collected) datasets have reasonably large sample sizes allowing stable test-retest estimates, this variability across datasets provides an opportunity to generalize reliability estimates across scanning platforms, acquisition approaches, and target populations. The range of designs included is captured by the following classifications: o Scan repeated on same day o Behavioral condition may or may not vary across scans depending on sample o Scan is repeated for 3 or more sessions in a short time-frame that is believed to be developmentally stable  o Scans repeated one or more times on same day, as well as across one or more sessions Table 1 presents an overview of the specific samples included in CoRR (Data Citations 1-31). The vast majority included a single retest scan (48% within-session, 52% between-session). Three samples employed serial scanning designs, and one sample had a longitudinal developmental component. Most samples included presumed neurotypical adults; exceptions include the pediatric samples from Institute of Psychology at Chinese Academy of Sciences (IPCAS 2/7), University of Pittsburgh School of Medicine (UPSM) and New York University (NYU) and the lifespan samples from Nathan Kline Institute (NKI 1).

Data Records Data privacy
Prior to contribution, each investigator confirmed that the data in their contribution was collected with the approval of their local ethical committee or institutional review board, and that sharing via CoRR was in accord with their policies. In accord with prior FCP/INDI policies, face information was removed from anatomical images (FullAnonymize.sh V1.0b; http://www.nitrc.org/frs/shownotes.php?release_id = 1902) and Neuroimaging Informatics Technology Initiative (NIFTI) headers replaced prior to open sharing to minimize the risk of re-identification.

Distribution for use
CoRR data sets can be accessed through either the COllaborative Informatics and Neuroimaging Suite (COINS) Data Exchange (http://coins.mrn.org/dx) 20 , or the Neuroimaging Informatics Tools and Resources Clearinghouse (NITRC; http://fcon_1000.projects.nitrc.org/indi/CoRR/html/index.html). CoRR datasets at the NITRC site are stored in .tar files sorted by site, each containing the necessary imaging data and phenotypic information. The COINS Data Exchange offers an enhanced graphical query tool, which enables users to target and download files in accord with specific search criteria. For each sharing venue, a user login must be established prior to downloading files. There are several groups of samples which were not included in the data analysis as they were in the data contribution/upload, preparation or correction stage at the time of analysis: Intrinsic Brain Activity, Test-Retest Dataset (IBATRT), Dartmouth College (DC 1), IPCAS 4, Hangzhou Normal University (HNU 2), Fudan University (FU 1), FU 2, Chengdu Huaxi Hospital (CHH 1), Max Planck Institute (MPG 1) 19 , Brain Genomics Superstruct Project (GSP) and New Jersey Institute of Technology (NJIT 1) (see more details on these sites at the CoRR website). Table 1 provides a static representation of the samples included in CoRR at the time of submission.

Imaging data
Consistent with its popularity in the imaging community and prior usage in FCP/INDI efforts, the NIFTI file format was selected for storage of CoRR imaging datasets, independent of modalities such as rfMRI, structural MRI (sMRI) and dMRI. Tables 2-4 (available online only) provide descriptions of the MRI sequences used for the various modalities for each of the imaging data file types.

Phenotypic information
All phenotypic data are stored in comma separated value (.csv) files. Basic information such as age and gender has been collected for each site to facilitate aggregation with minimal demographic variables. Table 5 (available online only) depicts the data legend provided to CoRR contributors.

Technical Validation
Consistent with the established FCP/INDI policy, all data contributed to CoRR was made available to users regardless of data quality. Justifications for this decision include the lack of consensus within the functional imaging community on criteria for quality assurance, and the utility of 'lower quality' datasets for facilitating the development of artifact correction techniques. For CoRR, the inclusion of datasets with significant artifacts related to factors such as motion are particularly valuable, as it enables the determination of the impact of such real-world confounds on reliability and reproducibility 21,22 . However, the absence of screening for data quality in the data release does not mean that the inclusion of poor quality datasets in imaging analyses is routine practice for the contributing sites. Figure 1 provides a summary map describing the anatomical coverage for rfMRI scans included in the CoRR dataset.
To facilitate quality assessment of the contributed samples and selection of datasets for analyses by individual users 23 , we made use of the Preprocessed Connectome Project quality assurance protocol (http://preprocessed-connectomes-project.github.io), which includes a broad range of quantitative metrics commonly used in the imaging literature for assessing data quality, as follows. They are itemized below: 24 27 . A measure of the mean signal in the 'ghost' image (signal present outside the brain due to acquisition in the phase encoding direction) relative to mean signal within the brain. o Artifact Detection (only sMRI) 28 . The proportion of voxels with intensity corrupted by artifacts normalized by the number of voxels in the background. o Contrast-to-Noise Ratio (CNR) (only sMRI) 24 . Calculated as the mean of the gray matter values minus the mean of the white matter values, divided by the standard deviation of the air values.

• Temporal Metrics (rfMRI)
o Head Motion ▪ Mean framewise displacement (FD) 29 . A measure of subject head motion, which compares the motion between the current and previous volumes. This is calculated by summing the absolute value of displacement changes in the x, y and z directions and rotational changes about those three axes. The rotational changes are given distance values based on the changes across the surface of a 50 mm radius sphere. ▪ Percent of volumes with FD greater than 0.2 mm ▪ Standardized DVARS. The spatial standard deviation of the temporal derivative of the data (D referring to temporal derivative of time series, VARS referring to root-mean-square variance over voxels) 29 , normalized by the temporal standard deviation and temporal autocorrelation (http://blogs.warwick.ac. uk/nichols/entry/standardizing_dvars).
o General ▪ Outlier Detection. The mean fraction of outliers found in each volume using 3dTout command in the software package for Analysis of Functional NeuroImages (AFNI: http://afni.nimh.nih.gov/afni). ▪ Median Distance Index. The mean distance (1-spearman's rho) between each time-point's volume and the median volume using AFNI's 3dTqual command. ▪ Global Correlation (GCOR) 30 . The average of the entire brain correlation matrix, which is computed as the brain-wide average time series correlation over all possible combinations of voxels.
Imaging data preprocessing was carried out with the Configurable Pipeline for the Analysis of Connectomes (C-PAC: http://www.nitrc.org/projects/cpac). Results for the sMRI images (spatial metrics) are depicted in Supplementary Figure 1, for the rfMRI scans in Supplementary Figure 2 (general spatial and temporal metrics) and Supplementary Figure 3 (head motion). For both sMRI and rfMRI, the battery of quality metrics revealed notable variations in image properties across sites. It is our hope that users will explore the impact of such variations in quality on the reliability of data derivatives, as well as potential relationships with acquisition parameters. Recent work examining the impact of head motion on reliability suggests the merits of such lines of questioning. Specifically, Yan and colleagues found that motion itself has moderate test-retest reliability, and appears to contribute to reliability when low, though it compromises reliability when high [31][32][33] . Although a comprehensive examination of this issue is beyond the scope of the present work, we did verify that motion does have moderate test-retest reliability in the CoRR datasets (see Figure 2) as previously suggested. Interestingly, this relationship appeared to be driven by the lower motion datasets (mean FDo0.2mm). Future work will undoubtedly benefit from further exploration of this phenomena and its impact of findings.
Beyond the above quality control metrics, a minimal set of rfMRI derivatives for the datasets were calculated for the datasets included in CoRR to further facilitate comparison of images across sites: o Fractional Amplitude of Low Frequency Fluctuations (fALFF) 34,35 . The total power in the low frequency range (0.01-0.1 Hz) of an fMRI image, normalized by the total power across all frequencies measured in that same image. o Voxel-Mirrored Homotopic Connectivity (VMHC) 36,37 . The functional connectivity between a pair of geometrically symmetric, inter-hemispheric voxels. o Regional Homogeneity (ReHo) [38][39][40] . The synchronicity of a voxel's time series and that of its nearest neighbors based on Kendall's coefficient of concordance to measure the local brain functional homogeneity. o Intrinsic Functional Connectivity (iFC) of Posterior Cingulate Cortex (PCC) 41 . Using the mean time series from a spherical region of interest (diameter = 8 mm) centered in PCC (x = − 8, y = − 56, z = 26) 42 , functional connectivity with PCC is calculated for each voxel in the brain using Pearson's correlation (results are Fisher r-to-z transformed).
To enable rapid comparison of derivatives, we: (1) calculated the 50th, 75th, and 90th percentile scores for each participant, and then (2) calculated site means and standard deviations for each of these scores (see Table 6 (available online only)). We opted to not use increasingly popular standardization approaches (for example, mean-regression, mean centering +/− variance normalization) in the calculation of derivative values, as the test-retest framework provides users a unique opportunity to consider the reliability of site-related differences. As can be seen in Supplementary Figure 4, for all the derivatives, the mean value or coefficient of variation obtained for a site was highly reliable. In the case of fALFF, site-specific differences can be directly related to the temporal sampling rate (that is, TR; see Figure 3), as lower TR datasets include a broader range of frequencies in the denominator-thereby reducing the resulting fALFF scores (differences in aliasing are likely to be present as well). This note of caution about fALFF raises the general issue that rfMRI estimates can be highly sensitive to acquisition parameters 7,13 . Specific factors contributing to differences in the other derivatives are less obvious (it is important to note that the correlation-based derivatives have some degree of standardization inherent to them). Interestingly, the coefficient of variation across participants also proved to be highly reliable for the various derivatives; while this may point to site-related differences in the ability to detect differences across participants, it may also be some reflection of the specific populations obtained at a site (or the sample size). Overall, these site-related differences highlight the potential value of post-hoc statistical standardization approaches, which can be used to handle unaccounted for sources of variation within-site as well 43 .
Finally, in Figure 4, we demonstrate the ability of the CoRR datasets to: (1) replicate prior work showing regional differences in inter-individual variation for the various derivatives that occur at 'transition zones' or boundaries between functional areas (even after mean-centering and variance normalization), and (2) show them to be highly reproducible across imaging sessions in the same sample. It is our hope that this demonstration will spark future work examining interindividual variation in these boundaries and their functional relevance. These surface renderings and visualizations are carried out with the Connectome Computation System (CCS) documented at http://lfcd.psych.ac.cn/ccs.html and will be released to the public via github soon (https://github.com/ zuoxinian/CCS).
To facilitate replication of our work, for each of Figures 1-3 and Supplementary Figures 1-4, we include a variable in the COINS phenotypic data that indicates whether or not each dataset was included in the analyses depicted. We also included this information in the phenotypic files on NITRC.

Usage Notes
While formal test-retest reliability or reproducibility analyses are beyond the scope of the present data description, we illustrate the broad range of potential questions that can be answered for rfMRI, dMRI and sMRI using the resource. These include the impact of: • Acquisition parameters 7 [51][52][53] Of note, at present, the vast majority of studies do not collect physiological data, and this is reflected in the CoRR initiative. With that said, recent advances in model-free correction (for example, ICA-FIX 54,55 , CORSICA 56 , PESTICA 57 , PHYCAA 58,59 ) can be of particular value in the absence of physiological data. Additional questions may include: • How reliable are image quality metrics? • How does reliability and reproducibility impact prediction accuracy? • How do imaging modalities (for example, rfMRI, dMRI, sMRI) differ with respect to reproducibility and reliability? And within modality, are some derivatives more reliable than others? • Can reliability and reproducibility be used to optimize imaging analyses? How can such optimizations avoid being driven by artifacts such as motion? • How much information regarding inter-individual variation is shared and distinct among imaging metrics? • Which features best differentiate one individual from another?
One example analytic framework that can be used with the CoRR test-retest datasets is Non-Parametric Activation and Influence Reproducibility reSampling (NPAIRS 60 ). By combining prediction accuracy and reproducibility, this computational framework can be used to assess the relative merits of differing image modalities, image metrics, or processing pipelines, as well as the impact of artifacts [61][62][63] .
Open access connectivity analysis packages that may be useful (list adapted from http://RFMRI.org): • Brain Connectivity Toolbox (BCT; MATLAB) 64 Figure 4. Test-retest plots of individual variation-related functional boundaries. Detection of functional boundaries was achieved via examination of voxel-wise coefficients of variation (CV) for fALFF, PCC, ReHo and VMHC maps. For the purpose of visualization, coefficients of variation were rank-ordered, whereby the relative degree of variation across participants at a given voxel, rather than the actual value, was plotted to better contrast brain regions. Ranking coefficients of variation (R-CV) efficiently identified regions of greatest inter-individual variability, thus delineating putative functional boundaries.  81