The open diffusion data derivatives, brain data upcycling via integrated publishing of derivatives and reproducible open cloud services

We describe the Open Diffusion Data Derivatives (O3D) repository: an integrated collection of preserved brain data derivatives and processing pipelines, published together using a single digital-object-identifier. The data derivatives were generated using modern diffusion-weighted magnetic resonance imaging data (dMRI) with diverse properties of resolution and signal-to-noise ratio. In addition to the data, we publish all processing pipelines (also referred to as open cloud services). The pipelines utilize modern methods for neuroimaging data processing (diffusion-signal modelling, fiber tracking, tractography evaluation, white matter segmentation, and structural connectome construction). The O3D open services can allow cognitive and clinical neuroscientists to run the connectome mapping algorithms on new, user-uploaded, data. Open source code implementing all O3D services is also provided to allow computational and computer scientists to reuse and extend the processing methods. Publishing both data-derivatives and integrated processing pipeline promotes practices for scientific reproducibility and data upcycling by providing open access to the research assets for utilization by multiple scientific communities.

www.nature.com/scientificdata www.nature.com/scientificdata/ taken to generate tractography, white matter tracts, and connectivity matrices. Additionally, new data can in principle be uploaded using the brainlife.io web-portal and used to generate new results. Data can also be downloaded using a simple web or command line interfaces format as BIDS (Brain Imaging Data Structure) 85,86 . Finally, open source code and containers implementing the processing pipelines can be found at github.com/brainlife and hub. docker.com/u/brainlife. The O3D repository is unique in that it focuses on publishing repeated-measures data-derivatives for tractography, white matter tracts, and structural connectome matrices-all associated with open services publishing reproducible data processing pipelines and workflows. The O3D dataset provides a means for computational test-retest quantification 41,[87][88][89] and reproducibility. To generate the data derivatives, three tractography algorithms were used ten times on the same data source (individual brain). Due to stochasticity of such algorithms, the results for each of these are slightly different. The number of repeats has been previously shown to allow measuring variability and reliability of connectome mapping methods 21,22,40 . The tractography results were evaluated using state-of-the-art methods 22,40 and compared against classical neuroanatomy atlases used to segment the major human white matter tracts 75,90 . Finally, a series of connection matrices (i.e. brain networks) were generated using standard cortical parcellation methods 91 . Three example scenarios can be used to demonstrate transdisciplinary applications and show how investigators from different communities can utilize the O3D core set. First, investigators developing network science algorithms 35,63,[92][93][94] might have an interest in demonstrating the applicability or efficacy of their methods on brain network data, but lack skills to process the raw diffusion data into connectivity matrices. The data derivatives provide an easily accessible point of entry by making available unthresholded brain connection matrices built using data from multiple individuals and different tracking methods. Second, investigators studying white matter neuroanatomy, or developing software for automated segmentation of white matter tracts, can use the data derivatives as complex test objects to compare the results of new algorithms with the state of the art reference set represented by O3D 25,[95][96][97] . Finally, the data derivatives can be an essential education and training resource. It may be used by students and trainees in the neural and clinical sciences to learn about neuroanatomy or to develop practical analytic skills. All O3D data is compatible with most major neuroimaging software packages and can be conveniently loaded, processed and visualized 40,[71][72][73]75,76 .
The present descriptor introduces the O3D repository and some of the brainlife.io publication mechanisms, as necessary to describe the repository. The O3D reference repository will allow investigators from multiple scientific communities to explore brain data, perform visualization experiments, and replicate the data derivatives without having to first learn a full processing pipeline. This lowers the barrier of entry to computational neuroimaging, with the potential to advance algorithmic development, increase the involvement of underrepresented scholars, and to facilitate training and validation 16,98 . The repeated measure data derivatives we plan to distribute as part of O3D will appeal to a diverse range of research interests because of the extensive know-how necessary to generate them. Consequently, they can be used by communities of basic, clinical, translational and computational scientists including neuroscientists, students and trainees early in their careers 16,98,99 .

Stanford dataset (STN).
We used data collected in four subjects at the Stanford Center for Cognitive and Neurobiological Imaging with a 3T General Electric Discovery 750 MRI (General Electric Healthcare), using a 32-channel head coil (Nova Medical). dMRI data had whole-brain coverage and were acquired with a dual-spin echo diffusion-weighted sequence, using 96 diffusion-weighting directions and gradient strength of 2,000 s/mm 2 (TE = 96.8 ms). Data spatial resolution was set at 1.5 mm isotropic. Each dMRI is the average of two measurements (NEX = 2). Ten non-diffusion-weighted images (b = 0) were acquired at the beginning of each scan 40,45,47 .
Human connectome project datasets (HCP3T and HCP7T). We used data collected in 8 subjects from the Human Connectome Project, using Siemens 3T and 7T MRI scanners. Only measurements from the 2,000 s/mm 2 shell were extracted from these data and used to generate the data derivatives in our repositories. Data from the 3T and 7T scanners have different properties of resolution (e.g., HCP3T, 90 gradient directions, 1.25 mm isotropic resolution and HCP7T, 60 gradient directions, 1.05 mm isotropic resolution) and have been described before along with the processing methods used for data preprocessing 44,100-102 . Data preprocessing. We developed a series of steps to process the anatomical and dMRI data files in a standardized manner for publication as part of the O3D repository. All original data were oriented to the plane defined by the Anterior and Posterior Commissure and the 2,000 s/mm 2 shell was selected and utilized for the subsequent analyses. All MRI data were oriented in Neurological coordinates (Left-Anterior-Superior) and the bvecs files were oriented accordingly. The brainlife.io Apps implementing these operations can be found at [103][104][105] (see also Tables 1 and 2). No additional denoising, eddy current or head movement correction was applied beyond that performed by the data originators.
Voxel signal reconstruction and tractography. White matter fascicles tracking was performed using MRtrix 0.2.12 76 . White-and gray-matter tissues were segmented with Freesurfer 72 using the T1-weighted MRI images associated to each individual brain, and then resampled at the resolution of the dMRI data. Only voxels identified primarily as white-matter tissue were used to constrain tracking. We used three different tracking methods: (A) tensor-based deterministic tracking 106,107 , (B) Constrained Spherical Deconvolution (CSD) -based www.nature.com/scientificdata www.nature.com/scientificdata/ deterministic tracking 76,108 , and (C) CSD-based probabilistic tracking 108,109 . Maximum harmonic orders L max = 10 (STN, HCP3T) and L max = 8 (HCP7T) were used 110,111 . Other parameters settings used to perform tracking were: step size: 0.2 mm; maximum length, 200 mm; minimum length, 10 mm. The fiber orientation distribution function (f ODF ) amplitude cutoff, was set to 0.1, and for the minimum radius of curvature we adopted the default values, fixed by MRtrix for each kind of tracking: 2 mm (DTI deterministic), 0 mm (CSD deterministic), 1 mm (CSD probabilistic). We generated repeated measures of tractography derivatives by computing 10 candidate whole-brain fascicles groups for each individual brain using 500,000 fascicles each. Apps implementing the methods can be found at 112-114 . Tractography evaluation. We used the Linear Fascicle Evaluation method (LiFE) 40 to optimize whole-brain tractograms implemented using the recently proposed ENCODE model 22 . The LiFE method identifies fascicles that successfully contribute to prediction of the measured dMRI signal. It has been shown that only a percentage of the total number of fascicles generated through a single tractography method is supported by the properties of given dataset 40,47 . Because of this we removed all fascicles making no significant contribution to explaining the diffusion measurements. The percentage of streamlines retained in these optimized fascicles groups ranged between 10-20% (STN), 15-35% (HCP3T) and 20-40% (HCP7T). Apps implementing the method can be found at 115 . White matter tracts segmentation. Twenty major human white matter tracts were segmented using the Automating Fiber-tract Quantification (AFQ) method 75 . An additional step refined the segmented tracts by removing the fiber outliers. The following tracts were segmented: left and right Anterior Thalamic Radiation Connection matrix construction. We used tractograms evaluated by the LiFE method to build connectivity matrices. Connectivity matrices were built for each fascicle groups using the 68 cortical regions from the Desikan Killiany atlas, segmented in each individual using T1w MRI images and FreeSurfer 72,91 . Fascicles

App goal
DOIs of each O3D App as service on brainlife.io  www.nature.com/scientificdata www.nature.com/scientificdata/ terminations were mapped onto each of the 68 regions. All fibers connecting pairs of brain regions were identified and collected. Adjacency matrices were built using two measures: (A) count 119 , by computing the number of fascicles connecting each unique pair of regions, (B) density, by computing the density of fibers connecting each unique pair -computed as twice the number of fascicles between regions divided by sum of the number of voxels in the two atlas regions 88,94,119,120 . Apps implementing the method can be found at 121 . Open service for reproducible neuroscience: brainlife.io/apps. We provide the full set of scripts used to process the O3D repository, both as open services, also referred to as Apps, that can be run on the brainlife. io platform (Table 1), as well as, code, scripts used to implement each App available on github.com/brainlife ( Table 2). Whereas the code can be downloaded for running locally the scripts, the Apps are embedded in the brainlife.io platform and can be reused to directly process data avoiding the needs of installing software.
Brainlife.io Apps can be improved over time by users or developers and for this reason their implementation can change. As such, brainlife.io uses github.com to keep track of App versions. We note that whereas the DOIs for the Apps reported in Table 1 direct users to the most recent version of each App available on the platform, the URLs in Table 2 direct users to the specific version of the code used for the preprocessing used to generate the published O3D dataset. To fully support the reproducibility of the O3D publication we preserve for each release both the data and a snapshot of the code for each App. The O3D Apps preserved with the original code version used to generate the repository is reported in 43 .

Data Records
Preserved O3D data and Apps can be downloaded at the web URL reported in 43 . Upon download, data will automatically be organized as brainlife.io DataTypes (brainlife.io/docs/user/datatypes and brainlife.io/datatypes) as well as according to the specification defined by the Brain Imaging Data Structure (BIDS) 85 . We note that, currently, BIDS does not officially provide a complete specification for diffusion-weighted magnetic resonance imaging and tractography derivatives.
According to the provisional BIDS specification for data derivatives (https://goo.gl/aFJ6vS), we have organized the files within folders, where each folder name refers to the name of the brainlifle.io App used to generate the files. The file naming convention adopted for the folders is based on three tokens: (A) The name of the github.com organization (e.g., brainlife); (B) the name of the repository of the App (e.g., app-life). All files generated by an App are aggregated in subfolders, one for each subject. Following the BIDS convention: (1) each file name includes a descriptor (_desc-) referring a unique brainlife.io identifier, (2) additional information on the brainlife.io DataType reported in filename by tags (_tag-), (3) the repeated measures are denoted by the keyword run (_run-), (4) the last token of the file names indicates the BIDS datatype (e.g., _dwi-), (5) the suffix denotes the file format (e.g., .nii.gz), and (6)  Tractography. The diffusivity signal reconstruction models generated the following volumetric images as NifTI files: fractional anisotropy (_FA.nii.gz), the diffusion tensor model (model-DTI) and the constrained spherical deconvolution model (model-CSD). A brain mask and a white matter mask are also distributed at the dMRI data resolution (type-Brain, type-Whitematter). To increase impact and compatibility of the O3D data files, two copies of each tractogram are distributed, one in MRtrix format (tck) and the other TrackVis format (trk). One file is outputted per repeated-measure tractogram, and tractography method (tag-dtstream, tag-sdstream, tag-sddprob).

Technical Validation
In this section we provide both a qualitative and quantitative evaluation of the data derivatives made available at 43 . We show data SNR in each dataset used, demonstrate quality of alignment between dMRI and anatomy files, and show the diffusion signal in the voxel reconstruction, several properties of the tractography models and of the major white matter tracts segmented.
Data preprocessing. Data preprocessing was performed using a combination of previously published pipelines 22,40,45-47 (see Methods for additional details). Diffusion weighted MRI data were aligned to the T1-weighted anatomical images (Fig. 1a left-hand columns, see Methods for additional details). The T1w images were used www.nature.com/scientificdata www.nature.com/scientificdata/ to segment the brain into different tissue types and brain regions 72 . The total white matter volume was identified using the previously generated white matter tissue segmentation and all subsequent analyses were performed within the white matter volume. Figure 1a shows how the white matter volume (mask) defined on the anatomical image (middle) aligns with the non diffusion-weighted signal (B 0 ) image of the diffusion MRI data (left-hand panel) in three example subjects one per dataset.
To compare dMRI data quality across datasets we computed the signal-to-noise ratio (SNR) comparing the mean attenuated dMRi signal to the background noise for both diffusion-weighted and B 0 measurements (Fig. 1b), as described by 124,125 . The brainlife.io App implementing this SNR method can be found at 126 . White matter microstructure reconstruction within the voxel. The dMRI signal within each voxel was reconstructed using the two dominant models, namely the diffusion tensor (DTI 127 ) and constrained-spherical deconvolution (CSD 110,111 ). Specifically, when applying CSD, we utilized an L max parameter of 10 for STN and 8 for HCP. These models provide different opportunities as well as limitations to characterize the dMRI signal and brain fibers. Figure 2 shows the quality of the estimated deconvolution kernel (a) and the fit of the CSD model in three representative axial brain slices, one per dataset (b). The kernel estimation is important for effective fiber distribution estimation and long-range tracking 128 . Both dMRI reconstructions (DTI and CSD) have been manually curated by visual inspection to assure quality in the O3D dataset.
Tractography. Tractography was reconstructed using two established methods: deterministic and probabilistic 76,106-108,129-131 tractography. We used Deterministic tractography either in combination with DTI or CSD models. Probabilistic tractography was only used in combination with the CSD model. It has been established that application of these different methods result in the generation of white matter fascicles with different anatomical properties 29,40,47,54,[132][133][134] . The O3D dataset provides three tractography reconstructions for each individual brain. Tractography outputs were stored using common file formats (.tck and.trk) to allow investigators to compare, reuse and improve upon current tracking methods. Figure 3a provides a qualitative depiction of the whole-brain tractography reconstruction in a subject from each dataset. Figure 3b reports a quantitative comparison of the fascicles length distribution for whole brain tractograms in the three example subjects in Fig. 3a. S1 S2 S3 S4 S9 S10 S11 S12 www.nature.com/scientificdata www.nature.com/scientificdata/ Human major white matter tracts. We report a qualitative visualization of the eleven major white matter tracts which were segmented from each connectome. These correspond to nine major tracts in the left and right hemispheres and two cross-hemispheric tracts. These tracts were segmented using a standardized methodology and atlases 75,90,135 . Files are saved as.tck and.trk file formats. Previous work has shown that the application of different tractography models results in anatomical tracts with different morphologies, volumes and streamline counts 22,29,40,47,54 . Figure 4a depicts these tracts as segmented for each subject, using each diffusion model, with colors corresponding to specific tracts. Figure 4b plots the number of streamlines, from the source whole brain tractogram, identified as constituting each of these major tracts.
Network neuroscience. The aforementioned whole brain tractograms represent a model of how the white matter of the brain connects cortical regions to one another. Together with a cortical parcellation, this rich body   The whole-brain connectome streamline count for each of the three tractography models applied to the STN, HCP3T and HCP7T datasets.
www.nature.com/scientificdata www.nature.com/scientificdata/ of connectivity information can be summarized into a network matrix, with brain regions or regions of interest representing network nodes, and measures related to connection weight or density corresponding to network edges. Graphical summaries like those presented in Fig. 5 provide a common way to visualize these connectivity patterns. This graph or network representation of connectomes enables a large array of analytic and modeling tools to probe connectivity motifs, modularity, centrality, vulnerability and other network or graph-theoretic measures 63,[136][137][138] . The O3D dataset features structural connectivity data, arranged as matrices, along with the numeric key indicating the cortical parcels names for each network node. Connectivity matrices were computed using two edge metrics: streamline count and streamline density 88

CSDProb.
CSDDet. Fig. 4 Anatomy of tracts and number of fascicles per tract. (a) The morphologies of several major tracts, overlaid with one another, as segmented from whole brain connectomes. Tractography generated for each dataset using DTI deterministic , CSD deterministic and CSD probabilistic models. Colors correspond to individual tracts. (b) The streamline counts associated with several major tracts. Marker color corresponds to tractography model. Error bars generated from standard deviation across ten replications.  91 . Matrices report fiber density computed as twice the number of streamlines touching a pair of regions divided by the combined size of the two regions (in number of brain voxels). Density is normalized across matrices, brighter colors indicate higher density. Networks depicted were generated for three representative subjects, one per dataset, using DTI deterministic , CSD deterministic and CSD probabilistic tractography. (2019) 6:69 | https://doi.org/10.1038/s41597-019-0073-y www.nature.com/scientificdata www.nature.com/scientificdata/

Usage Notes
The O3D dataset is publicly available at the link provided in 43 . Data files can be downloaded organized according to the BIDS 85 standard. Different data derivatives are distributed with formats, such as NifTI, TCK, TRK or plain text. Access to the published data is currently supported via (i) web interface and (ii) Command Line Interface (CLI).
The brainlife.io CLI can be installed on most Unix/Linux systems using the following command: npm install brainlife -g. The CLI can be used to query and download partial of full datasets. The following example shows the CLI command to download all T1w datasets from a subject in the publication data Release 2: bl pub query # this will return the publication IDs bl bids download --pub 5c0ff604391ed50032b634d1 --subject 0001 --datatype neuro/anat/t1w The following command downloads the data in the entire project (from Release 2) into BIDS format: bl bids download --pub 5c0ff604391ed50032b634d1 Additional information about the brainlife.io CLI commands can be found at https://github.com/brainlife/cli In addition, https://brainlife.io/project/5a022fc99c0d250055709e9c/detail is the project page with read-only data supporting browsing, visualization, download or additional processing. O3D uses the data originated from projects with different license and user terms. The four datasets (subject 1-4) originated from the Stanford University project are distributed with CC-BY license (creativecommons.org/licenses/by/4.0/). Access to the eight datasets originated from the Human Connectome Project (subject 5-12) require that users agree to the HCP Data Use Terms humanconnectome.org/study/hcp-young-adult/data-use-terms.