Learning multi-site harmonization of magnetic resonance images without traveling human phantoms

Harmonization improves Magn. Reson. Imaging (MRI) data consistency and is central to effective integration of diverse imaging data acquired across multiple sites. Recent deep learning techniques for harmonization are predominantly supervised in nature and hence require imaging data of the same human subjects to be acquired at multiple sites. Data collection as such requires the human subjects to travel across sites and is hence challenging, costly, and impractical, more so when sufficient sample size is needed for reliable network training. Here we show how harmonization can be achieved with a deep neural network that does not rely on traveling human phantom data. Our method disentangles site-specific appearance information and site-invariant anatomical information from images acquired at multiple sites and then employs the disentangled information to generate the image of each subject for any target site. We demonstrate with more than 6,000 multi-site T1- and T2-weighted images that our method is remarkably effective in generating images with realistic site-specific appearances without altering anatomical details. Our method allows retrospective harmonization of data in a wide range of existing modern large-scale imaging studies, conducted via different scanners and protocols, without additional data collection.


Introduction
Magnetic resonance imaging (MRI) is a non-invasive and versatile technique that provides good soft tissue contrasts useful for diagnosis, prognosis, and treatment monitoring.Since MRI experiments are costly and time-consuming, modern large-scale MRI studies typically rely on multiple imaging sites to collaboratively collect data with greater sample sizes for more comprehensive coverage of factors that can affect study outcomes, such as age, gender, geography, socioeconomic status, and disease subtypes.Notable examples of multi-site studies include the Adolescent Brain Cognitive Development (ABCD) 1 , the Alzheimer's Disease Neuroimaging Initiative (ADNI) 2 , and the Australian Imaging, Biomarkers and Lifestyle Flagship Study of Aging (AIBL) 3 .
Multi-site data collection leads inevitably to undesirable elevation of non-biological variability introduced by differences in scanners 4 and imaging protocols 5 .Protocols can be prospectively harmonized by selecting for the individual sites the acquisition parameters that result in images with maximal inter-site consistency.However, prospective harmonization requires extensive data Estimate using a control region of the brain the latent factors of unwanted variation common to all voxels.
Combating batch effects (ComBat) 13, 14 Identify batch-specific transformation to express all data in a common space.

Learning
Machine learning Regression ensembles with patch learning for image contrast agreement (REPLICA) 15 Supervised training of a random forest for nonlinear regression of a target contrast.

NeuroHarmony 16
Supervised training of a random forest for nonlinear regression of a target contrast determined based on prescribed regions.

Deep learning DeepHarmony 17
Supervised training of a U-Net to produce images with a consistent contrast.
Disentangled Latent Space (DLS) 18  Supervised training of a deep neural network using disentangled anatomical and contrast components.

Unlearning of dataset bias 19
Supervised training of a deep neural network to learn scanner-invariant features and then reducing scanner influence on network predictions in tasks of interest.

MURD
Unsupervised harmonization of images using content and style disentanglement learned from jointly multi-site images.
acquisition for parameter tuning, needs to be performed before each study, and does not allow for correction of data collected in studies that have already taken place.Moreover, significant intersite variability can still occur in data collected with harmonized acquisition protocols simply due to irreconcilable differences between scanners 4 .Retrospective MRI harmonization 6 overcomes these limitations by performing post-acquisition correction and is hence applicable to existing studies for improving the accuracy, reliability, and reproducibility of downstream processing, analyses, and inferences.
Existing retrospective harmonization methods are either statistics-based or learning-based (Table 1).Statistics-based methods align intensity distributions via intensity normalization [7][8][9][10][11] or batch effect adjustment [12][13][14] .However, these methods are typically limited to whole-image, but not detail-specific, harmonization.Learning-based methods translate images between sites via nonlinear mappings determined using machine learning 15,16 or deep learning [17][18][19] , with or without supervision.Machine learning methods predict harmonized images using regression models learned,  i=1 , MURD disentangles each image set X i in a site-invariant content space C and a site-specific style space S i with the content features aligned in the content space C.This is achieved using content-style disentangled cycle translation (CS-DCT).b, Site-specific CS-DCT.An image from the i-th site x i is encoded in content space C and style space S i to obtain content features c i and style features s i .Style features of the j-th site s j (j = i) are generated using style generator G S j .s j and c i decoded to generate a harmonized image for the j-th site xj , which is in turn encoded to generate content features cj for reconstructing image xi with style features s i .c, Reference-specific CS-DCT.Unlike site-specific CS-DCT in b, style features s j are obtained by encoding a reference image x j in style space S j .d, The MURD framework.MURD implements CS-DCT by employing a site-shared content encoder, a site-specific style encoder, a site-shared generator, a site-specific style generator, and a site-specific discriminator for content encoding, style encoding, decoding, style generation, and adversarial learning, respectively.MURD is constrained by four types of losses: consistency losses L cyc , L cont and L sty , adversarial loss L adv , content alignment loss L ca , and identity loss L id .
typically with supervision, with hand-crafted features.In contrast, deep learning techniques automatically extract features pertinent to the harmonization task.Supervised methods typically require training data acquired from traveling human phantoms, which might not always be available for large-scale, multi-site, or longitudinal studies.Unsupervised deep learning techniques [20][21][22] determine mappings between sites using unpaired images and therefore avoid the need for traveling human phantom data.These methods are however unscalable to large-scale multi-site MRI harmonization as they typically learn pair-wise mappings between sites.For N sites, these methods learn N (N − 1) mappings for all site pairs and therefore require a large amount of data for learning the multitude of network parameters.These pair-wise methods are also ineffective by not fully and jointly utilizing complementary information available from all sites.
Here, we draw inspiration from recent advancements in multi-domain image-to-image translation [23][24][25] and introduce a unified framework for simultaneous multi-site harmonization using only a single deep learning model.Our method, called multi-site unsupervised representation disentangler (MURD), decomposes an image into anatomical content that is invariant across sites and appearance style (e.g., intensity and contrast) that is dependent on each site.Harmonized images are generated by combining the content of the original image with styles specific to the different sites.More specifically, encoding an image in site-invariant and site-specific feature spaces is achieved via two encoders, i.e., a content encoder that captures anatomical structures common across sites and a style encoder that captures style information specific to each site.A site-harmonized image is generated via a generator that combines the extracted content with the style associated with a target site.The target style can be specified by a reference image from a site or by a randomized style code generated by a style generator specific to each site.The latter allows multiple appearances to be generated in relation to natural style variations associated with each site.The network is trained with losses that are designed to promote full representation disentanglement and to maximally preserve structural details.MURD is summarized in Figure 1 and detailed in the Methods section.

Materials
We demonstrate the effectiveness of MURD on brain T1-and T2-weighted images of children 9 to 10 years of age acquired through the Adolescent Brain Cognitive Development (ABCD) study 1 using scanners from different vendors, including General Electric (GE), Philips, and Siemens.Informed consents were obtained from all participants 1 .Although the imaging protocols were matched across scanners at the different imaging sites 27 , inter-scanner variability in appearance is still significant (Figure 2).For simplicity, we group the images and treat them as coming from three virtual sites-the GE, Philips, and Siemens sites.We trained and tested MURD separately for T1-weighted images and T2-weighted images.For each modality, we trained MURD using a modest sample size of 20 images per vendor.Three datasets per modality were used for testing different aspects of MURD: (i) Validation Dataset -A small but diverse dataset of 10 images per vendor, carefully selected   to be structurally different from the images in the training dataset.The purpose of this testing dataset is to test the effectiveness of MURD beyond the training dataset.(ii) Generalizability Dataset -A large dataset of 1000 images randomly selected for each vendor.This testing dataset is allowed to be deliberately much larger than the training dataset to test the generalizability of MURD.(iii) Human Phantom Dataset -A traveling human phantom dataset of 1 subject scanned on both GE and Philips scanners, 5 subjects scanned on both GE and Siemens scanners, and 2 subjects scanned on both Philips and Siemens scanners.This testing dataset allows numerical evaluation to ensure that anatomical structures are preserved after harmonization.Images for each subject were aligned using Advanced Normalization Tools (ANTS) 28 .

MURD Harmonizes Contrasts and Preserves Details
We demonstrate the effectiveness of MURD harmonization on T1-and T2-weighted images (Figure 2).Note that MURD allows harmonization with respect to either a site or a reference image.The former amounts to harmonization with respect to an image randomly drawn from a site image distribution.MURD is remarkably effective in adapting contrast and preserving details when harmonizing images from a source site with a target site (Figures 2a and b).When the source and target sites are identical, MURD retain both contrast and details.When given a reference image, MURD harmonizes the contrasts of images from a source site with the reference image but preserves the anatomical details of the original images (Figures 2c and d).

MURD Outperforms State-of-the-Art Methods
To further demonstrate the effectiveness and superiority of MURD, we compared it with two closely-related state-of-the-art unsupervised methods, i.e., DRIT++ 26 and StarGAN-v2 25 , which are designed respectively for dual-domain and multi-domain image-to-image translation.DRIT++ and StarGAN-v2 were implemented and trained as described in original papers 25,26 .DRIT++ was trained once for every pair of sites.StarGAN-v2 was trained concurrently for multiple sites, similar to MURD.We compared the visual quality of the harmonized images using two metrics, i.e., the Frechét inception distance (FID) 29 and the kernel inception distance (KID) 30 , which reflect distribution discrepancy between two sets of images in a manner that correlates well with human judgment 31 .FID and KID are respectively computed based on the Frechét distance and maximum mean discrepancy (MMD) of features from the last average pooling layer of the Inception-V3 32 trained on the ImageNet 33 .FID and KID were computed at the slice level for the harmonized images with respect to the training images of the target sites.For site-specific harmonization, 10 target-site harmonized images were generated for each testing image of the source site with 10 randomly generated style codes.For reference-specific harmonization, each image in the source site was harmonized with respect to 10 reference images randomly selected from the testing images of a target site.3c and d) yields FID and KID values that are highly consistent with the validation dataset and close to the reference values.This indicates that, although trained using a modest dataset, the model is generalizable to a much larger dataset.

MURD Efficacy Validated via Traveling Human Phantom Data
The human phantom dataset allows direct quantitative evaluation of the effects of harmonization on consistency of structure and appearance.Based on multiple metrics, including mean absolute error (MAE), multi-scale structural similarity index (MS-SSIM), and peak signal-to-noise ratio (PSNR), MURD significantly outperforms DRIT++ and StarGAN-v2, indicating better harmonization of contrast and preservation of anatomical details (Figures 4a and b).
MURD Improves Tissue Segmentation Consistency Segmentation of brain tissues is sensitive to variation in image contrast and under-or over-segmentation might happen owing to differences in acquisition protocols.We applied Brain Extraction Tool (BET) 34 and FMRIB's Automated Segmentation Tool (FAST) 35 on T1-and T2-weighted images in the human phantom dataset for brain extraction and tissue segmentation.Tissue segmentation consistency before and after harmonization was measured using the Dice similarity coefficient (DSC) with the tissue segmentation maps from the target site as references.The results (Figures 4c and d) indicate that DSCs are improved remarkably by harmonization using MURD.

MURD Supports Continuous and Incremental Harmonization
We further demonstrate that MURD completely disentangles content and style information by visualization via continuous and incremental harmonization.We generated images with between-site appearances to aid visual inspection of how appearance changes gradually between sites and whether unwanted anatomical alterations are introduced in the process.

Discussion
We have demonstrated that MURD is remarkably effective in harmonizing MR images by removing non-biological site differences and at the same time preserving anatomical details.MURD network training involves only site-labeled images and requires no traveling human phantom data.This flexibility allows data from existing large-scale studies to be harmonized retrospectively without requiring additional data to be collected.
We have shown that MURD yields superior performance over DRIT++ 26 and StarGAN-v2 25 .For every pair of sites, DRIT++ embeds images in a site-invariant content space capturing information shared across sites and a site-specific style space.The encoded content features extracted from an image of one site are combined with style features from another site to synthesize the corresponding harmonized image.Learning is unsupervised and hence paired data is not required.However, DRIT++ is not scalable due to the need to learn all mappings for all site pairs.DRIT++ is also ineffective because it cannot fully utilize the entire training data and can only learn from two sites at each time, causing it to miss global features that can be learned from images of all sites.Failure to fully utilize training data likely limits the quality of generated images.Unlike DRIT++, StarGAN-v2 25 is scalable and performs image-to-image translations for multiple sites using only a single model.It has been applied to the problem of MRI harmonization 36 with promising results.In addition to greater scalability, StarGAN-v2 generates images of higher visual quality owing to its ability to jointly consider the information offered by images from all sites.StarGAN-v2, however, does not explicitly disentangle images into structural and appearance information.This introduces the possibility of altering anatomical details during harmonization via style transfer.In contrast, MURD enforces explicit disentanglement of content and style features by jointly considering images from all sites, allowing it to produce harmonized images with diverse appearances with significantly better preservation of anatomical details (see Supplementary Figures 1-3).Disentanglement safeguards harmonization against altering image anatomical contents and allows gradual and controllable harmonization via interpolation of style features.
The harmonization target is specified for DRIT++ and StarGAN-v2 by a reference image.The appearance of the harmonized image is determined by the style features extracted from the reference image.In addition to a reference image, the harmonization target for MURD can be specified by a site label, which determines the output branch of the style generator and the style encoder.A latent code sampled from the standard Gaussian distribution determines an appearance specific to the site.
A recent MRI harmonization method, called CALAMITI 37 , relies on intra-site supervised image-to-image translation and unsupervised domain adaptation for multi-site harmonization.This requires training a disentangled representation model with intra-site multi-contrast images (T1and T2-weighted images) of the same subjects and retraining the model for a new site via domain adaptation 38 .Unlike CALAMITI, MURD requires only images from a single contrast and can learn multi-site harmonization simultaneously without needing fine-tuning or retraining.

Methods
We consider the multi-site harmonization problem as image-to-image translation among multiple sites and propose an end-to-end multi-site unsupervised representation disentangler (MURD, see Figure 1) to learn content-style disentangled cycle translation mappings that translate images forward and backward between any two sites.We employ two encoders to respectively embed each image in a site-invariant content space, which captures anatomical information, and a site-specific style space, which captures appearance information, and a generator to produce a harmonized image using the encoded content features and site-specific style features.
Content-Style Disentangled Cycle Translation (CS-DCT) Given image sets {X i } N i=1 from N distinct sites, MURD utilizes content-style disentangled cycle translation (CS-DCT) to disentangle each image set X i in a site-invariant content space C and a site-specific style space S i (Figure 1a).CS-DCT, realized with sequential forward and backward translation, can be site-specific (Figure 1b) or reference-specific (Figure 1c).MURD jointly learns site-invariant content features c i ∈ C and site-specific style features s i ∈ S i from image x i ∈ X i , and utilize generator G to construct the harmonized image xj ∈ X j (j = i) using content features c i and style features s j ∈ S j in forward translation.Style features s j are generated by a style generator G S j or extracted from a reference image x j ∈ X j .In backward translation, MURD extracts the content features cj from the harmonized image xj , which are in turn fed with style features s i to generator G to reconstruct image xi , which is required to be consistent with the input image x i .
Multi-site Unsupervised Representation Disentangler (MURD) MURD implements CS-DCT in a end-to-end manner via five modules (Figure 1d): (i) A content encoder E C , shared for all sites, to extract content features from an image in a site-invariant space C; (ii) A style encoder E S i for each site i to extract style features from an image in a site-specific style space S i ; (iii) A generator G, shared for all sites, to synthesize images using content and style features; (iv) A style generator G S i for each site i to yield style features s i that reflect the appearance style of images from the site; and (v) A discriminator D i for each site i to distinguish between real and generated images.
Specifically, given an input image x i from image set X i , content encoder E C and style encoder E S i , respectively, extract content features c i and style features s i .For a random site j = i, style generator G S j takes a latent code z randomly sampled from a standard normal distribution N (0, 1) as input to create a j-site style features s j .With content features c i and style features s j , generator G constructs harmonized image xj , which the discriminator D j then classifies as being either real or fake using an adversarial loss L adv : Content encoder E C and style encoder E S j are used to extract content features cj and style features sj from xj .The consistency between c i and cj is enforced by a pixel-wise content consistency loss The consistency between s j and sj is enforced by a pixel-wise style consistency loss L sty : Content site-invariance is enforced by a content alignment loss L ca : where KL(• •) is the Kullback-Leibler divergence and I is an identity matrix.Content features c i are randomly perturbed during the feed-forward step: The reconstructed image xi of x i is produced by generator G using content features cj and style features s i .The consistency between x i and xi is ensured by a pixel-wise and gradient-wise cycle consistency loss L cyc : where g(•) is the image gradient function and λ g is the loss weight for the gradient loss term.Furthermore, an identity image xi can also be constructed by generator G using content features c i and style features s i , which are identical to x i when c i and s i are completely disentangled.We devise an identity loss to measure the pixel-wise difference between x i and xi as All modules are optimized with total loss function L: where λ cont , λ ca , λ sty , λ cyc , and λ id are loss weights used for controlling the contributions of the respectively terms in the loss function.Training ends when all modules are optimized, such that the optimized discriminator classifies the harmonized images into one of two categories with equal probability.See Supplementary Figures 4-7 for the effects of the individual loss functions.
During inference, content features c i of input image x i are extracted using content encoder E C , and style features s j are either generated using style generator G S j or extracted from a reference image x j .Harmonized image xj is created by generator G using c i and s j .
Network Architecture The architectures of the components of MURD are described next.

Content Encoder
Content encoder E C is shared among all sites and extracts the content features c i of an input image x i through three convolutional blocks and 4 residual blocks.Each convolutional block is composed of three sequential layers, i.e., convolution, instance normalization (IN) 39 , and leaky ReLU (LReLU) activation.Each residual block consists of a nonlinear mapping branch and a shortcut branch.The nonlinear mapping branch is constructed by a series of layers, i.e., convolution, IN, LReLU, convolution, and IN.The shortcut branch is an identity mapping of the block input.We use an IN layer instead of a batch normalization layer 39 to accelerate model convergence and maintain independence between features.All normalized feature maps are activated by LReLU with a negative slope of 0.2.

Style Encoder Style encoder E S
i is composed of site-shared and site-specific subnetworks.The site-shared subnetwork is constructed by a convolution layer, four pre-activation residual blocks, and a global average pooling layer.The site-specific subnetwork is composed of N fully connected layers corresponding to the N individual sites.The pre-activation residual block is constructed by integrating LReLU activation followed by a convolution layer with unit stride into a residual block, where an average pooling layer is adopted to downsample the intermediate features and the shortcut branch is implemented by an average pooling layer and a convolution layer with unit kernel size and stride.Note that we extract style features without IN layers since IN removes feature means and variances, which contain important style information.The output dimension is set to 64.Style features s i have the same dimension.
Generator Site-shared generator G merges content features c i and style features s j to create a harmonized image xj using four residual blocks identical to the content encoder, two upsampling blocks and a convolution layer with hyperbolic tangent (tanh) activation.The upsampling block consists of deconvolution, IN, and LReLU activation.
Style Generator Style generator G S j consists of a multilayer perception (MLP) with N output branches.Six fully connected layers are shared among all sites, followed by a fully connected layer for each site.We set the dimensions of the latent code, the hidden layer, and the style features to 16, 256, and 64, respectively.We randomly sample the latent code z from the standard Gaussian distribution.
Discriminator Discriminator D j consists of site-shared and site-specific subnetworks, similar to the style encoder.Specifically, three convolutional blocks and a global average pooling are shared among all sites, followed by a specific fully connected layer for real/fake classification for each site.
Implementation Details MURD was implemented using Tensorflow.Evaluation was based on a machine with a CPU (i.e., Intel i7-8700K) and a GPU (i.e., NVIDIA GeForce GTX 1080Ti 11GB).The ADAM optimizer with 1 × 10 −4 learning rate was utilized for minimization based on the loss function.For all datasets, we used λ cont = 10, λ ca = 0.01, λ sty = 10, λ cyc = 10, λ g = 0.1, and λ id = 10 for the corresponding losses.For all datasets, every three adjacent slices in each volume were inserted into three channels.Each channel was then normalized to have a range between -1 and 1 and zero-padded to 256×256.

Figure 1 :
Figure 1: Overview of MURD.a, Multi-site representation disentanglement.Given image sets from N sites {X i } Ni=1 , MURD disentangles each image set X i in a site-invariant content space C and a site-specific style space S i with the content features aligned in the content space C.This is achieved using content-style disentangled cycle translation (CS-DCT).b, Site-specific CS-DCT.An image from the i-th site x i is encoded in content space C and style space S i to obtain content features c i and style features s i .Style features of the j-th site s j (j = i) are generated using style generator G S j .s j and c i decoded to generate a harmonized image for the j-th site xj , which is in turn encoded to generate content features cj for reconstructing image xi with style features s i .c, Reference-specific CS-DCT.Unlike site-specific CS-DCT in b, style features s j are obtained by encoding a reference image x j in style space S j .d, The MURD framework.MURD implements CS-DCT by employing a site-shared content encoder, a site-specific style encoder, a site-shared generator, a site-specific style generator, and a site-specific discriminator for content encoding, style encoding, decoding, style generation, and adversarial learning, respectively.MURD is constrained by four types of losses: consistency losses L cyc , L cont and L sty , adversarial loss L adv , content alignment loss L ca , and identity loss L id .

Figure 2 :
Figure 2: Harmonization of T1-and T2-weighted Images.Site-specific harmonization of a, T1weighted images and b, T2-weighted images.Reference-specific harmonization of c, T1-weighted images and d, T2-weighted images.The original images are shown in the first column and the harmonized images are shown for GE, Philips and Siemens respectively in the second to fourth columns.MURD is remarkably effective in harmonizing contrasts and preserving details.

Figure 3 :
Figure3: Numerical Evaluation of Harmonization Outcomes.Quantitative evaluation conducted for the GE, Philips, and Siemens sites using FID and KID as metrics for a, T1-weighted images (n = 600 slices per site) and b, T2-weighted images (n = 600 slices per site) from the validation dataset, and c, T1-weighted images (n = 60000 slices per site) and d, T2-weighted images (n = 60000 slices per site) from the generalizability dataset.Bullseyes show the means and error bars show the standard errors on the means.MURD, both site-specific (SS) and reference-specific (RS), yields lower FID and KID values that are closer to the reference values than DRIT++26 and StarGAN-v225 .The FID and KID values for the generalizability dataset are largely consistent with those of the validation dataset.

Figure 4 :Figure 5 :
Figure 4: Validation via the Traveling Human Phantom Dataset.Evaluation of harmonized a, T1-weighted images and b, T2-weighted images from the traveling human phantom dataset for the following harmonization tasks: GE to Philips (GP, n = 60 slices), Philips to GE (PG, n = 60 slices), GE to Siemens (GS, n = 600 slices), Siemens to GE (SG, n = 600 slices), Philips to Siemens (PS, n = 120 slices), and Siemens to Philips (SP, n = 120 slices).Segmentation consistency of c, T1-weighted images and d, T2-weighted images from the traveling human phantom dataset with and without harmonization: GP (n = 1 volume), PG (n = 1 volume), GS (n = 5 volumes), SG (n = 5 volumes), PS (n = 2 volumes), and SP (n = 2 volumes).The lines show the means and the shaded regions show the standard errors on the means.MURD yields the best performance in terms of MAE, MS-SSIM, and PSNR.The improvement in DSCs indicates that MURD harmonization significantly increases consistency of tissue segmentation.

Table 1 :
Existing MRI harmonization methods Intermediate style features are calculated based on the style features of Site A, s A , and Site B, s B , via (1 − β)s A + βs B , where β ∈ [0, 1].The intermediate style features and the content features of an image are then used to generate an intermediate image.The results (Figures 5) indicate that MURD gradually and smoothly changes image appearance without altering anatomical details.