Background & Summary

In the past decade, efforts in large-scale neuroimaging data collection have redirected research attention towards effective practices for data sharing, reuse, standardization and secondary data analyses. Some of the most notable examples include projects such as the Human Connectome Project1,2,3, UK Biobank4,5, ADNI6, INDI7, ABCD8, CamCAN9 and OpenfMRI (recently rebranded as OpenNeuro)10. These projects have served as the bellwether for data-sharing in a growing culture focused on advancing methods for open big-data reproducible science11,12,13,14,15. Similar efforts for large-scale shared collections can, in principle, promote the establishment of best practices for measurements standards, and neuroinformatics methods, thereby contributing to a new generation of Big Data neuroscience research16,17.

The use of the same data can be different across scientific communities. Data sharing can increase data value by promoting reuse for purposes beyond those of the original project; a process we call data upcycling18. As part of this upcycling process, data derivatives (secondary products generated by the various data analyses processes) can become useful data for scientific communities outside of the community of origin. Succinctly stated, various scientific communities may have different interests in reusing brain data. For example, a white matter segmentation can be used by computer scientists for methods development19,20,21,22,23,24,25,26 or by neuroscientists to understand the brain27,28,29,30,31,32. Indeed, data can be reused for several applications not foreseen in the original study33, for example, to develop theoretical frameworks34, new algorithms22,35,36, advance data visualization practices37,38,39, and even for statistical validation of results40,41,42. The process of upcycling can help to extract additional value from openly available data sets, thereby returning continuing dividends from the initial resource investments.

We propose a unique approach to brain data upcycling by presenting the Open Diffusion Data Derivatives (O3D), a repository composed of both data-derivatives and their associated processing pipelines, bundled together and referenced by a single digital object identifier (DOI43). The O3D data were derived from anatomical (T1-weighted), diffusion-weighted magnetic resonance imaging (dMRI) data and tractography methods. The O3D data were obtained from previously published high-quality, high-resolution dMRI data1,40,44,45 and processing pipelines22,40,45,46,47. The dataset is comprised of (1) the minimally preprocessed dMRI data files (12 brains from three different datasets with different properties of signal-to-noise ratio and resolution) and (2) a large set of diverse data derivatives comprising 360 tractograms, 7,200 segmented major tracts, and 720 connection matrices. The total size of the O3D repository is approximately 1.79 Terabytes of data derivatives.

Diffusion-weighted magnetic resonance imaging (dMRI) and tractography allow measuring structural connectomes, white matter macro-anatomy, and microscopic tissue properties from the living human brain. These techniques have revolutionized our understanding of how brain networks and the brain’s white matter impact human behavior, in health and disease29,30,48,49,50,51,52,53,54,55,56,57,58. Neurotractography techniques provide fundamental insights about the human brain, and yet there is much work that remains to be done to map the human connectome40,59,60,61,62,63. Because of the complexity of these methods, the success of the modern scientific enterprise in mapping the human connectome almost certainly depends on transdisciplinary contributions from multiple communities – from Psychology and Neuroscience to Mathematics and Statistics, as well as to Computer Science and Engineering. For this reason open scientific discovery and collaborative sharing of methods, software, and data are of paramount importance16,64,65.

In addition to data, the processing pipelines used to generate the O3D data are made available on the platform as a series of “open services,” hereafter referred to as Apps or simply Apps. We define open services as self-contained processing applications embedded and reusable in a cloud platform environment. The platform allows running said Apps to process data available within the platform itself66,67,68. The concept of open service is akin to that of the Brain Imaging Data Structure Applications69 as also introduced previously by others70. The Apps used below follow a generalized and light-weight specification as to allow usage with diverse combinations of software from multiple libraries, such as FSL71, FreeSurfer72, DIPY73, Nipype74, LiFE22,40, AFQ75, MRtrix76, and AFNI77. These Apps can be containerized and made reproducible using technologies such as Docker78 and Singularity79. Alternatively, Apps can also run without containerization on software environments compatible with NeuroDebian80. The platform currently utilizes a mixture of public (jetstream-cloud.org81,82; opensciencegrid.org68), commercial ( and, as well as institutional ( computing resources. The platform is a registered DataCite center (, member of the catalogue (see83), as well as registered project on both the NeuroImaging Tools and Resources Collaboratory ( and (RRID: SCR_016513).

Publication records on, such as O3D, are preserved for at least ten years since latest use, and comply with the metadata specification to promote maximum discoverability and respect of the FAIR principles84. The complete list of Apps used to generate the O3D data are preserved as part of the repository43. These Apps are both provided as preserved files to allow accessing of the code version used to generate the specific O3D data, and can be reused for future research. The publication and preservation strategy is resilient to version changes likely to occur over time for each App or dataset. A full description of the platform and Apps is beyond the scope of this data descriptor; more information can be found here:

Using the O3D project, investigators can either process new data using the same pipelines used to generate the core O3D repository or they can download data derivatives processed at different stages along the series of steps taken to generate tractography, white matter tracts, and connectivity matrices. Additionally, new data can in principle be uploaded using the web-portal and used to generate new results. Data can also be downloaded using a simple web or command line interfaces format as BIDS (Brain Imaging Data Structure)85,86. Finally, open source code and containers implementing the processing pipelines can be found at and

The O3D repository is unique in that it focuses on publishing repeated-measures data-derivatives for tractography, white matter tracts, and structural connectome matrices–all associated with open services publishing reproducible data processing pipelines and workflows. The O3D dataset provides a means for computational test-retest quantification41,87,88,89 and reproducibility. To generate the data derivatives, three tractography algorithms were used ten times on the same data source (individual brain). Due to stochasticity of such algorithms, the results for each of these are slightly different. The number of repeats has been previously shown to allow measuring variability and reliability of connectome mapping methods21,22,40. The tractography results were evaluated using state-of-the-art methods22,40 and compared against classical neuroanatomy atlases used to segment the major human white matter tracts75,90. Finally, a series of connection matrices (i.e. brain networks) were generated using standard cortical parcellation methods91. Three example scenarios can be used to demonstrate transdisciplinary applications and show how investigators from different communities can utilize the O3D core set. First, investigators developing network science algorithms35,63,92,93,94 might have an interest in demonstrating the applicability or efficacy of their methods on brain network data, but lack skills to process the raw diffusion data into connectivity matrices. The data derivatives provide an easily accessible point of entry by making available unthresholded brain connection matrices built using data from multiple individuals and different tracking methods. Second, investigators studying white matter neuroanatomy, or developing software for automated segmentation of white matter tracts, can use the data derivatives as complex test objects to compare the results of new algorithms with the state of the art reference set represented by O3D25,95,96,97. Finally, the data derivatives can be an essential education and training resource. It may be used by students and trainees in the neural and clinical sciences to learn about neuroanatomy or to develop practical analytic skills. All O3D data is compatible with most major neuroimaging software packages and can be conveniently loaded, processed and visualized40,71,72,73,75,76.

The present descriptor introduces the O3D repository and some of the publication mechanisms, as necessary to describe the repository. The O3D reference repository will allow investigators from multiple scientific communities to explore brain data, perform visualization experiments, and replicate the data derivatives without having to first learn a full processing pipeline. This lowers the barrier of entry to computational neuroimaging, with the potential to advance algorithmic development, increase the involvement of underrepresented scholars, and to facilitate training and validation16,98. The repeated measure data derivatives we plan to distribute as part of O3D will appeal to a diverse range of research interests because of the extensive know-how necessary to generate them. Consequently, they can be used by communities of basic, clinical, translational and computational scientists including neuroscientists, students and trainees early in their careers16,98,99.


Data sources

Three diffusion-weighted Magnetic Resonance Imaging datasets (dMRI) were used to generate all the derivatives in the initial repository layout, from publicly available sources (,,45, and,100).

Stanford dataset (STN)

We used data collected in four subjects at the Stanford Center for Cognitive and Neurobiological Imaging with a 3T General Electric Discovery 750 MRI (General Electric Healthcare), using a 32-channel head coil (Nova Medical). dMRI data had whole-brain coverage and were acquired with a dual-spin echo diffusion-weighted sequence, using 96 diffusion-weighting directions and gradient strength of 2,000 s/mm2 (TE = 96.8 ms). Data spatial resolution was set at 1.5 mm isotropic. Each dMRI is the average of two measurements (NEX = 2). Ten non-diffusion-weighted images (b = 0) were acquired at the beginning of each scan40,45,47.

Human connectome project datasets (HCP3T and HCP7T)

We used data collected in 8 subjects from the Human Connectome Project, using Siemens 3T and 7T MRI scanners. Only measurements from the 2,000 s/mm2 shell were extracted from these data and used to generate the data derivatives in our repositories. Data from the 3T and 7T scanners have different properties of resolution (e.g., HCP3T, 90 gradient directions, 1.25 mm isotropic resolution and HCP7T, 60 gradient directions, 1.05 mm isotropic resolution) and have been described before along with the processing methods used for data preprocessing44,100,101,102.

Data preprocessing

We developed a series of steps to process the anatomical and dMRI data files in a standardized manner for publication as part of the O3D repository. All original data were oriented to the plane defined by the Anterior and Posterior Commissure and the 2,000 s/mm2 shell was selected and utilized for the subsequent analyses. All MRI data were oriented in Neurological coordinates (Left-Anterior-Superior) and the bvecs files were oriented accordingly. The Apps implementing these operations can be found at103,104,105 (see also Tables 1 and 2). No additional denoising, eddy current or head movement correction was applied beyond that performed by the data originators.

Table 1 List of the current Apps implementing the processing steps used to generate O3D to be re-used on as open services.
Table 2 List of software repositories with the code version of the scripts implementing the O3D Apps.

Voxel signal reconstruction and tractography

White matter fascicles tracking was performed using MRtrix 0.2.1276. White- and gray-matter tissues were segmented with Freesurfer72 using the T1-weighted MRI images associated to each individual brain, and then resampled at the resolution of the dMRI data. Only voxels identified primarily as white-matter tissue were used to constrain tracking. We used three different tracking methods: (A) tensor-based deterministic tracking106,107, (B) Constrained Spherical Deconvolution (CSD) -based deterministic tracking76,108, and (C) CSD-based probabilistic tracking108,109. Maximum harmonic orders Lmax = 10 (STN, HCP3T) and Lmax = 8 (HCP7T) were used110,111. Other parameters settings used to perform tracking were: step size: 0.2 mm; maximum length, 200 mm; minimum length, 10 mm. The fiber orientation distribution function (fODF) amplitude cutoff, was set to 0.1, and for the minimum radius of curvature we adopted the default values, fixed by MRtrix for each kind of tracking: 2 mm (DTI deterministic), 0 mm (CSD deterministic), 1 mm (CSD probabilistic). We generated repeated measures of tractography derivatives by computing 10 candidate whole-brain fascicles groups for each individual brain using 500,000 fascicles each. Apps implementing the methods can be found at112,113,114.

Tractography evaluation

We used the Linear Fascicle Evaluation method (LiFE)40 to optimize whole-brain tractograms implemented using the recently proposed ENCODE model22. The LiFE method identifies fascicles that successfully contribute to prediction of the measured dMRI signal. It has been shown that only a percentage of the total number of fascicles generated through a single tractography method is supported by the properties of given dataset40,47. Because of this we removed all fascicles making no significant contribution to explaining the diffusion measurements. The percentage of streamlines retained in these optimized fascicles groups ranged between 10–20% (STN), 15–35% (HCP3T) and 20–40% (HCP7T). Apps implementing the method can be found at115.

White matter tracts segmentation

Twenty major human white matter tracts were segmented using the Automating Fiber-tract Quantification (AFQ) method75. An additional step refined the segmented tracts by removing the fiber outliers. The following tracts were segmented: left and right Anterior Thalamic Radiation (ATRl and ATRr), left and right corticospinal tract (CSTl and CSTr), left and right Cingulum - Cingulate gyrus (CCgl and CCgr), left and right Cingulum - Hippocampus portion (CHil and CHir), left and right Inferior Fronto-Occipital Fasciculus (IFOFl and IFOFr), left and right Inferior Longitudinal Fasciculus (ILFl and ILFr), left and right Superior Longitudinal Fasciculus (SLFl and SLFr), left and right Uncinate Fasciculus (UFl and UFr), left and right Superior Longitudinal Fasciculus - Temporal part (often referred to as the “arcuate fasciculus”, SLFTl and SLFTr), Forceps Major (FMJ), and Forceps Minor (FMI). Each tract was stored in trackvis file format. Apps implementing the method can be found at116,117,118.

Connection matrix construction

We used tractograms evaluated by the LiFE method to build connectivity matrices. Connectivity matrices were built for each fascicle groups using the 68 cortical regions from the Desikan Killiany atlas, segmented in each individual using T1w MRI images and FreeSurfer72,91. Fascicles terminations were mapped onto each of the 68 regions. All fibers connecting pairs of brain regions were identified and collected. Adjacency matrices were built using two measures: (A) count119, by computing the number of fascicles connecting each unique pair of regions, (B) density, by computing the density of fibers connecting each unique pair – computed as twice the number of fascicles between regions divided by sum of the number of voxels in the two atlas regions88,94,119,120. Apps implementing the method can be found at121.

Open service for reproducible neuroscience:

We provide the full set of scripts used to process the O3D repository, both as open services, also referred to as Apps, that can be run on the platform (Table 1), as well as, code, scripts used to implement each App available on (Table 2). Whereas the code can be downloaded for running locally the scripts, the Apps are embedded in the platform and can be reused to directly process data avoiding the needs of installing software. Apps can be improved over time by users or developers and for this reason their implementation can change. As such, uses to keep track of App versions. We note that whereas the DOIs for the Apps reported in Table 1 direct users to the most recent version of each App available on the platform, the URLs in Table 2 direct users to the specific version of the code used for the preprocessing used to generate the published O3D dataset. To fully support the reproducibility of the O3D publication we preserve for each release both the data and a snapshot of the code for each App. The O3D Apps preserved with the original code version used to generate the repository is reported in43.

Data Records

Preserved O3D data and Apps can be downloaded at the web URL reported in43. Upon download, data will automatically be organized as DataTypes ( and as well as according to the specification defined by the Brain Imaging Data Structure (BIDS)85. We note that, currently, BIDS does not officially provide a complete specification for diffusion-weighted magnetic resonance imaging and tractography derivatives.

According to the provisional BIDS specification for data derivatives (, we have organized the files within folders, where each folder name refers to the name of the App used to generate the files. The file naming convention adopted for the folders is based on three tokens: (A) The name of the organization (e.g., brainlife); (B) the name of the repository of the App (e.g., app-life). All files generated by an App are aggregated in subfolders, one for each subject. Following the BIDS convention: (1) each file name includes a descriptor (_desc-) referring a unique identifier, (2) additional information on the DataType reported in filename by tags (_tag-), (3) the repeated measures are denoted by the keyword run (_run-), (4) the last token of the file names indicates the BIDS datatype (e.g., _dwi-), (5) the suffix denotes the file format (e.g., .nii.gz), and (6) metadata are recorded as a JSON file122.

Source data

The source files of anatomy uploaded to are stored as follow:




The source files of diffusion MRI uploaded from Stanford to are stored as follow:






The source files of diffusion MRI uploaded from HCP to are stored as follow{}/dwi/





Data preprocessing

The diffusion data after normalization (alignment and orientation) are stored as follows:{}/dwi/






The diffusivity signal reconstruction models generated the following volumetric images as NifTI files: fractional anisotropy (_FA.nii.gz), the diffusion tensor model (model-DTI) and the constrained spherical deconvolution model (model-CSD). A brain mask and a white matter mask are also distributed at the dMRI data resolution (type-Brain, type-Whitematter). To increase impact and compatibility of the O3D data files, two copies of each tractogram are distributed, one in MRtrix format (tck) and the other TrackVis format (trk). One file is outputted per repeated-measure tractogram, and tractography method (tag-dtstream, tag-sdstream, tag-sddprob).{}/dwi/


















Tractography evaluation

We used the Linear Fascicle Model to evaluate the quality of fit of the dMRI signal. The output of LiFE tractography evaluation process is stored as an encode model structure22. The encode DataType is stored as matlab structure (_life.mat). A detailed documentation of encode model structure is available at One encode model structure per repeated-measure tractogram is distributed, for a total of ten runs, one for each tractography algorithm.{}/dwi/







White matter classification

Twenty human major white matter tracts were classified for each Tractogram and are distributed using the TRK file format. A json file for each tractogram records for each tract the enumeration ID, the label of the tract and the number of fibers.{}/dwi/













Connectome matrices

Connection matrices were built using the aforementioned tractograms and the Desikan-Killiany Atlas from FreeSurfer91. A connection matrix was computed for each repeated-measure tractogram, processed using the LiFE method, and for each tractography method {dtstream, sdstream, sddprob}. Two measures of connectivity were computed: fiber count and fiber density (fiber count divided by the volume of the two termination areas119,123). Connection matrices are stored as pairs of.csv and.json files. A NifTI file records the cortical parcellation used to define the ROIs of the networks.{}/dwi/





Technical Validation

In this section we provide both a qualitative and quantitative evaluation of the data derivatives made available at43. We show data SNR in each dataset used, demonstrate quality of alignment between dMRI and anatomy files, and show the diffusion signal in the voxel reconstruction, several properties of the tractography models and of the major white matter tracts segmented.

Data preprocessing

Data preprocessing was performed using a combination of previously published pipelines22,40,45,46,47 (see Methods for additional details). Diffusion weighted MRI data were aligned to the T1-weighted anatomical images (Fig. 1a left-hand columns, see Methods for additional details). The T1w images were used to segment the brain into different tissue types and brain regions72. The total white matter volume was identified using the previously generated white matter tissue segmentation and all subsequent analyses were performed within the white matter volume. Figure 1a shows how the white matter volume (mask) defined on the anatomical image (middle) aligns with the non diffusion-weighted signal (B0) image of the diffusion MRI data (left-hand panel) in three example subjects one per dataset.

Fig. 1
figure 1

Data quality and preprocessing. (a) Axial view of dMRI (left, non-diffusion weighted volume, B0), aligned anatomical image (center) and white matter mask obtained from the anatomy (white), overlaid on the B0 to show the quality of the white matter volume delineation. One example subject is reproduced from the Stanford (top), Human Connectome 3T (middle) and Human Connectome 7T (bottom) data. (b) Mean and ±1 sd across diffusion-weighted measurements of the signal-to-noise (SNR) for each subject and dataset in the O3D distribution as implemented at126.

To compare dMRI data quality across datasets we computed the signal-to-noise ratio (SNR) comparing the mean attenuated dMRi signal to the background noise for both diffusion-weighted and B0 measurements (Fig. 1b), as described by124,125. The App implementing this SNR method can be found at126.

White matter microstructure reconstruction within the voxel

The dMRI signal within each voxel was reconstructed using the two dominant models, namely the diffusion tensor (DTI127) and constrained-spherical deconvolution (CSD110,111). Specifically, when applying CSD, we utilized an Lmax parameter of 10 for STN and 8 for HCP. These models provide different opportunities as well as limitations to characterize the dMRI signal and brain fibers. Figure 2 shows the quality of the estimated deconvolution kernel (a) and the fit of the CSD model in three representative axial brain slices, one per dataset (b). The kernel estimation is important for effective fiber distribution estimation and long-range tracking128. Both dMRI reconstructions (DTI and CSD) have been manually curated by visual inspection to assure quality in the O3D dataset.

Fig. 2
figure 2

Estimated fiber orientation distribution functions (fODF). (a) Examples of estimated single-fiber response function used to compute the fODF individually in each subject. The similarity and flat shape of the response functions ensures model-fit quality110,111. (b) Axial brain views from three example subjects in each dataset depicting the estimated fODF (fiber orientation distribution functions) in a series of voxels covering the corpus callosum and the central-semiovale. Coverage of the response functions and orientation are consistent with major anatomical understanding.


Tractography was reconstructed using two established methods: deterministic and probabilistic76,106,107,108,129,130,131 tractography. We used Deterministic tractography either in combination with DTI or CSD models. Probabilistic tractography was only used in combination with the CSD model. It has been established that application of these different methods result in the generation of white matter fascicles with different anatomical properties29,40,47,54,132,133,134. The O3D dataset provides three tractography reconstructions for each individual brain. Tractography outputs were stored using common file formats (.tck and.trk) to allow investigators to compare, reuse and improve upon current tracking methods.

Figure 3a provides a qualitative depiction of the whole-brain tractography reconstruction in a subject from each dataset. Figure 3b reports a quantitative comparison of the fascicles length distribution for whole brain tractograms in the three example subjects in Fig. 3a.

Fig. 3
figure 3

Visualization of whole-brain tractograms and fascicle length distribution. (a) The full brain tractography for each of the three datasets, as generated using DTIdeterministic, CSDdeterministic and CSDprobabilistic Models. (b) The whole-brain connectome streamline count for each of the three tractography models applied to the STN, HCP3T and HCP7T datasets.

Human major white matter tracts

We report a qualitative visualization of the eleven major white matter tracts which were segmented from each connectome. These correspond to nine major tracts in the left and right hemispheres and two cross-hemispheric tracts. These tracts were segmented using a standardized methodology and atlases75,90,135. Files are saved as.tck and.trk file formats. Previous work has shown that the application of different tractography models results in anatomical tracts with different morphologies, volumes and streamline counts22,29,40,47,54. Figure 4a depicts these tracts as segmented for each subject, using each diffusion model, with colors corresponding to specific tracts. Figure 4b plots the number of streamlines, from the source whole brain tractogram, identified as constituting each of these major tracts.

Fig. 4
figure 4

Anatomy of tracts and number of fascicles per tract. (a) The morphologies of several major tracts, overlaid with one another, as segmented from whole brain connectomes. Tractography generated for each dataset using DTIdeterministic, CSDdeterministic and CSDprobabilistic models. Colors correspond to individual tracts. (b) The streamline counts associated with several major tracts. Marker color corresponds to tractography model. Error bars generated from standard deviation across ten replications.

Network neuroscience

The aforementioned whole brain tractograms represent a model of how the white matter of the brain connects cortical regions to one another. Together with a cortical parcellation, this rich body of connectivity information can be summarized into a network matrix, with brain regions or regions of interest representing network nodes, and measures related to connection weight or density corresponding to network edges. Graphical summaries like those presented in Fig. 5 provide a common way to visualize these connectivity patterns. This graph or network representation of connectomes enables a large array of analytic and modeling tools to probe connectivity motifs, modularity, centrality, vulnerability and other network or graph-theoretic measures63,136,137,138. The O3D dataset features structural connectivity data, arranged as matrices, along with the numeric key indicating the cortical parcels names for each network node. Connectivity matrices were computed using two edge metrics: streamline count and streamline density88,119,123.

Fig. 5
figure 5

Brain network matrices. Nine representative matrices of connectivity between anatomical regions defined in the Desikan-Killiany atlas91. Matrices report fiber density computed as twice the number of streamlines touching a pair of regions divided by the combined size of the two regions (in number of brain voxels). Density is normalized across matrices, brighter colors indicate higher density. Networks depicted were generated for three representative subjects, one per dataset, using DTIdeterministic, CSDdeterministic and CSDprobabilistic tractography.

Usage Notes

The O3D dataset is publicly available at the link provided in43. Data files can be downloaded organized according to the BIDS85 standard. Different data derivatives are distributed with formats, such as NifTI, TCK, TRK or plain text. Access to the published data is currently supported via (i) web interface and (ii) Command Line Interface (CLI).

The CLI can be installed on most Unix/Linux systems using the following command: npm install brainlife -g. The CLI can be used to query and download partial of full datasets. The following example shows the CLI command to download all T1w datasets from a subject in the publication data Release 2:

bl pub query # this will return the publication IDs

bl bids download --pub 5c0ff604391ed50032b634d1 --subject 0001 --datatype neuro/anat/t1w

The following command downloads the data in the entire project (from Release 2) into BIDS format:

bl bids download --pub 5c0ff604391ed50032b634d1

Additional information about the CLI commands can be found at In addition, is the project page with read-only data supporting browsing, visualization, download or additional processing. O3D uses the data originated from projects with different license and user terms. The four datasets (subject 1–4) originated from the Stanford University project are distributed with CC-BY license ( Access to the eight datasets originated from the Human Connectome Project (subject 5–12) require that users agree to the HCP Data Use Terms