Federated learning enables big data for rare cancer boundary detection

Although machine learning (ML) has shown promise across disciplines, out-of-sample generalizability is concerning. This is currently addressed by sharing multi-site data, but such centralization is challenging/infeasible to scale due to various limitations. Federated ML (FL) provides an alternative paradigm for accurate and generalizable ML, by only sharing numerical model updates. Here we present the largest FL study to-date, involving data from 71 sites across 6 continents, to generate an automatic tumor boundary detector for the rare disease of glioblastoma, reporting the largest such dataset in the literature (n = 6, 314). We demonstrate a 33% delineation improvement for the surgically targetable tumor, and 23% for the complete tumor extent, over a publicly trained model. We anticipate our study to: 1) enable more healthcare studies informed by large diverse data, ensuring meaningful results for rare diseases and underrepresented populations, 2) facilitate further analyses for glioblastoma by releasing our consensus model, and 3) demonstrate the FL effectiveness at such scale and task-complexity as a paradigm shift for multi-site collaborations, alleviating the need for data-sharing.

Abstract.Although machine learning (ML) has shown promise in numerous domains, there are concerns about generalizability to out-of-sample data.This is currently addressed by centrally sharing ample, and importantly diverse, data from multiple sites.However, such centralization is challenging to scale (or even not feasible) due to various limitations.Federated ML (FL) provides an alternative to train accurate and generalizable ML models, by only sharing numerical model updates.Here we present findings from the largest FL study to-date, involving data from 71 healthcare institutions across 6 continents, to generate an automatic tumor boundary detector for the rare disease of glioblastoma, utilizing the largest dataset of such patients ever used in the literature (25, 256 MRI scans from 6, 314 patients).We demonstrate a 33% improvement over a publicly trained model to delineate the surgically targetable tumor, and 23% improvement over the tumor's entire extent.We anticipate our study to: 1) enable more studies in healthcare informed by large and diverse data, ensuring meaningful results for rare diseases and underrepresented populations, 2) facilitate further quantitative analyses for glioblastoma via performance optimization of our consensus model for eventual public release, and 3) demonstrate the effectiveness of FL at such scale and task complexity as a paradigm shift for multi-site collaborations, alleviating the need for data sharing.
Keywords: federated learning, deep learning, convolutional neural network, segmentation, brain tumor, glioma, glioblastoma, FeTS, BraTS Advances in machine learning (ML), and particularly deep learning (DL), have shown promise in addressing complex healthcare problems [1][2][3][4][5][6][7][8][9][10][11][12][13][14] .However, there are concerns about generalizability on data from sources that did not participate in model training, i.e., "out-ofsample" data 15,16 .Literature indicates that training robust and accurate models requires large amounts of data [17][18][19] , the diversity of which affects model generalizability to "out-of-sample" cases 20 .To address these concerns, models need to be trained on data originating from numerous sites representing diverse population samples.The current paradigm for such multi-site collaborations is "centralized learning" (CL), in which data from different sites are shared to a centralized location following inter-site agreements [20][21][22][23] .However, such data centralization is difficult to scale (and might not even be feasible), especially at a global scale, due to concerns 24,25 relating to privacy, data-ownership, intellectual property, technical challenges (e.g., network and storage limitations), as well as compliance with varying regulatory policies (e.g., Health Insurance Portability and Accountability Act (HIPAA) of the United States 26 and the General Data Protection Regulation (GDPR) of the European Union 27 ).In contrast to this centralized paradigm, "federated learning" (FL) describes an approach where models are trained by only sharing model parameter updates from decentralized data (i.e., each site retains its data locally) 24,25,[28][29][30] , without sacrificing performance when compared to CL-trained models 25,29,[31][32][33][34][35] .Thus, FL can offer an alternative to CL, potentially creating a paradigm shift that alleviates the need for data sharing, and hence increase access to geographically-distinct collaborators, thereby increasing the size and diversity of data used to train ML models.
FL has tremendous potential in healthcare 36,37 , particularly towards addressing health disparities, under-served populations, and "rare" diseases 38 , by enabling ML models to gain knowledge from ample and diverse data that would otherwise not be available.With that in mind, here we focus on the "rare" disease of glioblastoma, and particularly on the detection of its extent using multi-parametric magnetic resonance imaging (mpMRI) scans 39 .While glioblastoma is the most common malignant primary brain tumor [40][41][42] , it is still classified as a "rare" disease, as its incidence rate (i.e., 3/100, 000 people) is substantially lower than the rare disease definition rate (i.e., < 10/100, 000 people) 38 .This means that single sites cannot collect large and diverse datasets to train robust and generalizable ML models, and necessitates collaboration between geographically distinct sites.Despite extensive efforts to improve prognosis of glioblastoma patients with intense multimodal therapy, their median overall survival is only 14.6 months after standard-of-care treatment, and 4 months without treatment 43 .Although the subtyping of glioblastoma has been improved 44 and the standardof-care treatment options have expanded during the last twenty years, there have been no substantial improvements in overall survival 45 .This reflects the major obstacle in treating these tumors that is their intrinsic heterogeneity 40,42 , and the need for analyses of larger and more diverse data towards better understanding the disease.In terms of radiologic appearance, glioblastomas comprise 3 main sub-compartments, defined as i) the "enhancing tumor" (ET), representing the vascular blood-brain barrier breakdown within the tumor, ii) the "tumor core" (TC), which includes the ET and the necrotic (NCR) part, and represents the surgically relevant part of the tumor, and iii) the "whole tumor" (WT), which is defined by the union of the TC and the peritumoral edematous/infiltrated tissue (ED), and represents the complete tumor extent relevant to radiotherapy (Fig. 1.b).Detecting these sub-compartment boundaries therefore defines a multi-parametric multi-class learning problem [46][47][48][49][50] , and is a critical first step towards further quantifying and assessing this heterogeneous rare disease and ultimately influencing clinical decision-making.
Co-authors in this study have previously introduced FL in healthcare in a simulated setting 29 and evaluated different FL approaches 25 for the same use case as the present study, i.e., detecting the boundaries of glioblastoma sub-compartments.Findings from these studies supported the superiority of the FL approach used in the present study (i.e., based on an aggregation server 24,28 ), which had almost identical performance to CL.The present study describes the largest to-date global FL effort to develop an accurate and generalizable ML model for detecting glioblastoma sub-compartment boundaries, based on 25, 256 MRI scans (over 5TB) of 6, 314 glioblastoma patients from 71 geographically distinct sites, across 6 continents (Fig. 1.a).Notably, this describes the largest and most diverse dataset of glioblastoma patients ever considered in the literature.It was the use of FL that successfully enabled our ML model to gain knowledge from such an unprecedented dataset.The extended global footprint and the task complexity is what sets this study apart from current literature, since it dealt with a multi-parametric multi-class problem with reference standards that require expert clinicians following an involved manual annotation protocol, rather than recording a categorical entry from medical records 30,51 .Moreover, varying characteristics of the mpMRI data due to scanner hardware and acquisition protocol differences 52,53 were handled at each collaborating site via established harmonized preprocessing pipelines [54][55][56][57] .
The scientific contributions of this manuscript can be summarized by i) the insights garnered during this work that can pave the way for more successful FL studies of increased scale and task complexity, ii) making a potential impact for the treatment of the rare disease of glioblastoma, by eventually publicly releasing an optimized trained consensus model for use in resource-constrained clinical settings, and iii) demonstrating the effectiveness of FL at such scale and task complexity as a paradigm shift redefining multi-site collaborations, while alleviating the need for data sharing.
Cases with any of these sequences missing were not included in the study.Note that no inclusion/exclusion criterion applied relating to the type of acquisition (i.e., both 2D axial and 3D acquisitions were included, with a preference for 3D if available), or the exact type of sequence (e.g., MP-RAGE vs SPGR).The only exclusion criterion was for T1-FLAIR scans that were intentionally excluded to avoid mixing varying tissue appearance due to the type of sequence, across native T1-weighted scans.
The eligibility of collaborating sites to participate in the federation was determined based on data availability, and approval by their respective institutional review board.55 sites participated as independent collaborators in the study defining a dataset of 6, 083 cases.The MRI scanners used for data acquisition were from multiple vendors (i.e., Siemens, GE, Philips, Hitachi, Toshiba), with magnetic field strength ranging from 1T to 3T.The data from all 55 collaborating sites followed a male:female ratio of 1.47 : 1 with age ranging between 7 and 94 years.
From all 55 collaborating sites, 49 were chosen to be part of the training phase, and 6 sites were categorized as "out-of-sample", i.e., none of these were part of the training stage.These specific 6 out-of-sample sites (Site IDs : 8, 11, 19, 20, 21, 43) were allocated based on their availability, i.e., they have indicated an expected delayed participation rendering them optimal for model generalizability validation.One of these 6 out-of-sample sites (Site 11) contributed aggregated a priori data from a multi-site randomized clinical trial for newly diagnosed glioblastoma (ClinicalTrials.govIdentifier: NCT00884741, RTOG0825 58,59 , ACRIN6686 60,61 ), with inherent diversity benefiting the intended generalizability validation purpose.The American College of Radiology (ACR -Site 11) serves as the custodian of this trial's imaging data on behalf of ECOG-ACRIN, who made the data available for this study.Following screening for availability of the 4 required mpMRI scans with sufficient signal-to-noise ratio judged by visual observation, a subset of 362 cases from the original trial data were included in this study.The out-of-sample data totaled 590 cases intentionally held-out of the federation, with the intention of validating the consensus model in completely unseen cases.To facilitate further such generalizability evaluation without burdening the collaborating sites, a subset consisting of 332 cases (including the multi-site clinical data provided by ACR) from this out-of-sample data was aggregated, to serve as the "centralized out-of-sample" dataset.Furthermore, the 49 sites participating in the training phase define a collective dataset of 5, 493 cases.The exact 49 site IDs are: 1, 2, 3, 4, 5, 6, 7, 9, 10, 12, 13, 14, 15, 16, 17, 18, 22, 23, 24, 25, 26, 27, 28, 29, 30,  31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 44, 45, 46, 48, 49, 50, 52, 53, 54, 56, 59, 60.These cases were automatically split at each site following a 4 : 1 ratio between cases for training and local validation.During the federated training phase the data used for the public initial model were also included as a dataset from a separate node, such that the contribution of sites providing the publicly available data is not forgotten within the global consensus model.This results in the final consensus model being developed based on data from 71 sites over a total dataset of 6, 314 cases.

Harmonized Data Preprocessing
Once each collaborating site identified their local data, they were asked to use the preprocessing functionality of the software platform we provided.This functionality follows the harmonized data preprocessing protocol defined by the BraTS challenge [54][55][56][57] , to account for inter-site acquisition protocol variations, e.g., 3D vs 2D axial plane acquisitions.

The neural network architecture
The trained model to delineate the different tumor sub-compartments was based on the popular 3D U-Net with residual connections (3D-ResUNet) [62][63][64][65][66] .The network had 30 base filters, with a learning rate of lr = 5 × 10 −5 optimized using the Adam optimizer 67 .For the loss function used in training, we used the generalized DSC score 68,69 (represented mathematically in Eq. 1) on the absolute complement of each tumor sub-compartment independently.Such mirrored DSC loss has been shown to capture variations in smaller regions better 70 .No penalties were used in the loss function, due to our use of 'mirrored' DSC loss [71][72][73] .The final layer of the model was a sigmoid layer, providing three channel outputs for each voxel in the input volume, one output channel per tumor sub-compartment.While the generalized DSC score was calculated using a binarized version of the output (check sigmoid value against the threshold 0.5) for the final prediction, we used the floating point DSC 74 during the training process.
where RL serves as the reference label, P M is the predicted mask, is the Hadamard product 75 (i.e., component wise multiplication), and x 1 is the L1-norm 76 , i.e., sum of the absolute values of all components).

The Federation
The collaborative network of the present study spans 6 continents (Fig. 1), with data from 71 geographically distinct sites.The training process was initiated when each collaborator securely connected to a central aggregation server, which resided behind a firewall at the University of Pennsylvania.As soon as the secure connection was established, the public initial model was passed to the collaborating site.Using FL based on an aggregation server, collaborating sites then trained the same network architecture on their local data for one epoch, and shared model updates with the central aggregation server.The central aggregation server received model updates from all collaborators, combined them (by averaging model parameters) and sent the consensus model back to each collaborator to continue their local training.Each such iteration is called a "federated round ".After not observing any meaningful changes since round 42, we stopped the training after a total of 73 federated rounds.Additionally, we performed all operations on the aggregator on secure hardware (by leveraging trusted execution environments 77 ), in order to increase the trust by all parties in the confidentiality of the model updates being computed and shared, as well as to increase the confidence in the integrity of the computations being performed 78 .
The federated training was initialized using a "public initial model " trained on 231 cases from 16 sites.This was done as opposed to a random initialization point to facilitate faster convergence of the model performance 79,80 .After the training process was complete, the "final consensus model " was obtained after model selection from all the global consensus models obtained for each federated round.Thus, the final consensus model was developed on 6, 314 cases from 71 sites.To quantitatively evaluate the performance of the trained models, 20% of the total cases contributed by each participating site were excluded from the model training process and were used as "local validation data".To further evaluate the generalizability of the models in unseen data, 6 sites were not involved in any of the training stages to represent an unseen "out-of-sample" data population of 590 cases.To facilitate further evaluation without burdening the collaborating sites, a subset (n = 332) of these cases was aggregated to serve as a "centralized out-of-sample" dataset.Model performance was quantitatively evaluated here using the Dice Similarity Coefficient (DSC), which assesses the spatial agreement between the model's prediction and the reference standard for each of the 3 tumor sub-compartments (ET, TC, WT).

Model Run-time in Clinical Environments
Clinical environments typically have constrained computational resources, such as the availability of specialized hardware (e.g., DL acceleration cards) and increased memory, which affect the run-time performance of DL inference workloads.Thus, taking into consideration the potential deployment of the final consensus model in such low-resource settings, we decided to proceed with a single 3D-ResUNet, rather than an ensemble of multiple models.This decision ensured a reduced computational burden, when compared with running multiple models, which is typically done in academic research projects [54][55][56][57] .
To further facilitate use in low-resource environments, we plan to publicly release run-time optimized 81 version of the final consensus model.For this model, graph level optimizations (i.e., operators fusion) were initially applied, followed by optimizations for low precision inference, i.e., converting the floating point single precision model to a fixed precision 8-bit integer model (a process known as "quantization" 82 ).In particular, we used accuracy-aware quantization 83 , where model layers were iteratively scaled to a lower precision format.These optimizations yielded several run-time performance benefits, such as lower inference latency (a platform-dependent 4.48× average speedup and 2.29× reduced memory requirement when compared with the original consensus model), and higher throughput (equal to the 4.48× speedup improvement since the batch size used is equal to 1), while the trade-off was an insignificant (p Average < 7 × 10 −5 ) drop in the average DSC.
6][87] .The co-registration is performed using the Greedy framework 88 , available via CaPTk [85][86][87] , ITK-SNAP 89 , and the FeTS tool.The brain extraction 90 is done using the BrainMaGe method 91 , and is available in https://github.com/CBICA/BrainMaGeand via the Generally Nuanced Deep Learning Framework (GaNDLF) 92 at https://github.com/CBICA/GaNDLF.To generate automated annotations, DeepMedic's 93 integration with CaPTk was used, and we used the model weights and inference mechanism provided by the other algorithm developers (DeepScan 94 and nnU-Net 70 (https://github.com/MIC-DKFZ/nnunet)).DeepMedic's original implementation is available in https://github.com/deepmedic/deepmedic,whereas the one we used in this study can be found at https://github.com/CBICA/deepmedic.The fusion of the labels was done using the Label Fusion tool 95 available at https://github.com/FETS-AI/LabelFusion.The data loading pipeline and network architecture was developed using the GaNDLF framework 92 by using PyTorch 96 .The data augmentation was done via GaNDLF by leveraging TorchIO 97 .The FL backend developed for this project has been opensourced as a separate software library, to encourage further research on FL 98 and is available at https://github.com/intel/openfl.The optimization of the consensus model inference workload was performed via OpenVINO 99 (https://github.com/openvinotoolkit/openvino/tree/2021.4.1), which is an open-source toolkit enabling acceleration of neural network models through various optimization techniques.The optimizations were evaluated on an Intel Core ® i7-1185G7E CPU @ 2.80GHz with 2 × 8 GB DDR4 3200MHz memory on Ubuntu 18.04.6OS and Linux kernel version 5.9.0-050900-generic.

Results
At the time of initialization of the federation, the public initial model was evaluated against the local validation data of all sites, resulting in an average (across all cases of all sites) DSC per sub-compartment, of: DSC ET = 0.63, DSC T C = 0.62, DSC W T = 0.75.To summarize the model performance with a single collective score, we then calculate the average DSC (across all 3 tumor sub-compartments per case, and then across all cases of all sites) as equal to 0.66.
To further evaluate the potential generalizability improvements of the final consensus model on unseen data, we compared it with the public initial model against the complete out-ofsample data, and noted significant performance improvements of 15% (p ET < 1 × 10 −5 ), 27% (p T C < 1 × 10 −16 ), and 16% (p W T < 1 × 10 −7 ), for ET, TC, and WT, respectively (Fig. 1.d).
Notably, the only difference between the public initial model and the final consensus model, was that the latter gained knowledge during training from increased datasets contributed by the complete set of collaborators.The conclusion of these findings reinforces the importance of using large and diverse data for generalizable models to ultimately drive patient care.

Discussion
In this study, we have described the largest real-world FL effort to-date utilizing data of 6, 314 glioblastoma patients from 71 geographically unique sites spread across 6 continents, to develop an accurate and generalizable ML model for detecting glioblastoma sub-compartment boundaries.Notably, this extensive global footprint of the collaborating sites in this study, also yield the largest dataset ever reported in the literature assessing this rare disease.It is the use of FL that successfully enabled i) access to such an unprecedented dataset of the most common and fatal adult brain tumor, and ii) meaningful ML training to ensure generalizability of models across out-of-sample data.Since glioblastoma boundary detection is critical for treatment planning and the requisite first step for further quantitative analyses, the models generated during this study have the potential to make far-reaching clinical impact.
The large and diverse data that FL enabled, led to the final consensus model garnering significant performance improvements over the public initial model against both the collaborators' local validation data and the complete out-of-sample data.The improved result is a clear indication of the benefit that can be afforded through access to more data.In comparison with the limited existing real-world FL studies 30,51 , our use case is larger in scale and substantially more complex, since it 1) addresses a multi-parametric multi-class problem, with reference standards that require expert collaborating clinicians to follow an involved manual annotation protocol, rather than simply recording a categorical entry from medical records, and 2) requires the data to be preprocessed in a harmonized manner to account for differences in MRI acquisition.
We have demonstrated the utility of an FL approach to develop an accurate and generalizable ML model for detecting glioblastoma sub-compartment boundaries, a finding which is of particular relevance for neurosurgical and radiotherapy planning in patients with this disease.This study is meant to be used as a blueprint for future FL studies that result in clinically deployable ML models.Building on this study, a continuous FL consortium would enable downstream quantitative analyses with implications for both routine practice and clinical trials, and most importantly, increase access to high-quality precision care worldwide.Furthermore, the lessons learned from this study with such a global footprint are invaluable and can be applied to a broad array of clinical scenarios with the potential for great impact to rare diseases and underrepresented populations.

Fig. 1 :
Fig. 1: Representation of the study's global scale, diversity, and complexity.a, The map of all sites involved in the development of FL consensus model.b, example of a glioblastoma mpMRI scan with corresponding reference annotations of the tumor sub-compartments.c-d, comparative performance evaluation of the final consensus model with the public initial model on the collaborators' local validation data (in c) and on the complete out-of-sample data (in d), per tumor sub-compartment.Note the box and whiskers inside each violin plot, represent the true min and max values.The top and bottom of each "box" depict the 3rd and 1st quartile of each measure.The white line and the red ×, within each box, indicate the median and mean values, respectively.The fact that these are not necessarily at the centre of each box indicates the skewness of the distribution over different cases.The "whiskers" drawn above and below each box depict the extremal observations still within 1.5 times the interquartile range, above the 3rd or below the 1st quartile.e, number of contributed cases per collaborating site.