An in-depth evaluation of federated learning on biomedical natural language processing for information extraction

Peng, Le; Luo, Gaoxiang; Zhou, Sicheng; Chen, Jiandong; Xu, Ziyue; Sun, Ju; Zhang, Rui

doi:10.1038/s41746-024-01126-4

Download PDF

Article
Open access
Published: 15 May 2024

An in-depth evaluation of federated learning on biomedical natural language processing for information extraction

npj Digital Medicine volume 7, Article number: 127 (2024) Cite this article

7 Altmetric
Metrics details

Subjects

Abstract

Language models (LMs) such as BERT and GPT have revolutionized natural language processing (NLP). However, the medical field faces challenges in training LMs due to limited data access and privacy constraints imposed by regulations like the Health Insurance Portability and Accountability Act (HIPPA) and the General Data Protection Regulation (GDPR). Federated learning (FL) offers a decentralized solution that enables collaborative learning while ensuring data privacy. In this study, we evaluated FL on 2 biomedical NLP tasks encompassing 8 corpora using 6 LMs. Our results show that: (1) FL models consistently outperformed models trained on individual clients’ data and sometimes performed comparably with models trained with polled data; (2) with the fixed number of total data, FL models training with more clients produced inferior performance but pre-trained transformer-based models exhibited great resilience. (3) FL models significantly outperformed pre-trained LLMs with few-shot prompting.

Testing theory of mind in large language models and humans

Article Open access 20 May 2024

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Article 14 May 2024

Introduction

The recent advances in deep learning have sparked the widespread adoption of language models (LMs), including prominent examples of BERT¹ and GPT², in the field of natural language processing (NLP). These LMs are trained on massive amounts of public text data, comprising billions of words, and have emerged as the dominant technology for various linguistic tasks, including text classification^3,4, text generation^5,6, information extraction^7,8,9, and question answering^10,11. The success of LMs can be largely attributed to their ability to leverage large volumes of training data. However, in privacy-sensitive domains like medicine, data are often naturally distributed, making it difficult to construct large corpora to train LMs. To tackle the challenge, the most common approach thus far has been to fine-tune pre-trained LMs for downstream tasks using limited annotated data^12,13. Nevertheless, pre-trained LMs are typically trained on text data collected from the general domain, which exhibits divergent patterns from that in the biomedical domain, resulting in a phenomenon known as domain shift. Compared to general text, biomedical texts can be highly specialized, containing domain-specific terminologies and abbreviations¹⁴. For example, medical records and drug descriptions often include specific terms that may not be present in general language corpora, and the terms often vary among different clinical institutes. Also, biomedical data lacks uniformity and standardization across sources, making it challenging to develop NLP models that can effectively handle different formats and structures. Electronic Health Records (EHRs) from different healthcare institutions, for instance, can have varying templates and coding systems¹⁵. So, direct transfer learning from LMs pre-trained on the general domain usually suffers a drop in performance and generalizability when applied to the medical domain as is also demonstrated in the literature¹⁶. Therefore, developing LMs that are specifically designed for the medical domain, using large volumes of domain-specific training data, is essential. Another vein of research explores pre-training the LM on biomedical data, e.g., BlueBERT¹² and PubMedBERT¹⁷. These LMs were either pre-trained on mixed-domain data (first pre-train on the general text and then keep pre-train on biomedical text) or directly pre-trained on domain-specific public medical datasets, e.g., PubMed literature and the Medical Information Mart for Intensive Care (MIMIC III)¹⁸ and have shown improved performances compared to classical methods such as conditional random field (CRF)¹⁹ and recurrent neural network (RNN) (e.g., long-short-term memory (LSTM)²⁰) in many biomedical text mining tasks^8,9,12,16,21. Nonetheless, it is important to highlight that the efficacy of these pre-trained medical LMs heavily relies on the availability of large volumes of task-relevant public data, which may not always be readily accessible.

All these mentioned above represent the classical centralized learning regime, which involves aggregating data from distributed data sites and training a model in a single environment. However, this approach poses significant challenges in medicine, where data privacy is crucial and data access is restricted due to regulatory concerns. Thus, in practice, people can only perform training with local datasets—single-client training. The drawback comes when the local dataset is small and often gives poor performance when evaluating an external dataset—poor generalization. To take advantage of the massively distributed data as well as improve the model generalizability, federated learning (FL) was initialized in 2016²² as a novel learning scheme to empower training with a decentralized environment and achieve many successes in critical domains with data privacy restriction^23,24,25. In an FL training loop, clients jointly train a shared global model by sharing the model weights or gradients while keeping their data stored locally. By bringing the model to the data, FL strictly ensures data privacy while achieving competitive levels of performance compared to a model trained with pooled data. While there is a rise of research showing great promise of applying FL in general NLP^26,27, applications of FL in biomedical NLP are still under-explored. Existing works in FL on biomedical NLP are either focused on optimizing one task^28,29 or trying to improve communication efficiency²⁸. The current literature lacks a comprehensive comparison of FL on varied biomedical NLP tasks with real-world perturbations. To close this gap, we conducted an in-depth study of two representative NLP tasks, i.e., named entity recognition (NER) and relation extraction (RE), to evaluate the feasibility of adopting FL (e.g., FedAvg³⁰ and FedProx³¹) with LMs (e.g., Transformer-based models) in biomedical NLP. Our study aims to provide an in-depth investigation of FL in biomedical NLP by studying several FL variants on multiple practical learning scenarios, including varied federation scales, different model architectures, data heterogeneities, and comparison with large language models (LLMs) on multiple benchmark datasets. Our major findings include:

1.
When data were independent and identically distributed (IID), models trained using FL, especially pre-trained BERT-based models, performed comparable to centralized learning, a significant boost to single-client learning. Even when data were non-IID distributed, the gap can be filled by using alternative FL algorithms.
2.
Larger models exhibited better resistance to the changes in FL scales. With a fixed number of data, the performance of FL models overall degraded as the clients’ size increased. However, the deterioration diminished when combined with larger pre-trained models such as BERT-based models and GPT-2.
3.
FL significantly outperformed pre-trained LLMs, e.g., GPT-4, PaLM 2, and Gemini Pro, with few-shot prompting.

Results

In this section, we present our main results of analysis on FL with a focus on several practical facets, including (1) learning tasks, (2) scalability, (3) data distribution, (4) model architectures and sizes, and (5) comparative assessments with LLMs.

FedAvg, single-client, and centralized learning for NER and RE tasks

Table 1 offers a summary of the performance evaluations for FedAvg, single-client learning, and centralized learning on five NER datasets, while Table 2 presents the results on three RE datasets. Our results on both tasks consistently demonstrate that FedAvg outperformed single-client learning. Notably, in cases involving large data volumes, such as BC4CHEMD and 2018 n2c2, FedAvg managed to attain performance levels on par with centralized learning, especially when combined with BERT-based pre-trained models.

Table 1 Comparison of FedAvg with centralized learning and single-client learning on 5 NER tasks measured by F1-score with lenient (upper) and strict (lower, inside parenthesis) matching scheme

Full size table

Table 2 Comparison of FedAvg with centralized learning and single-client learning on RE task measure by macro F1-score

Full size table

Influence of FL scale on the performance of LMs

In clinical applications, there are two distinct learning paradigms. The first involves small-scale client cohorts, each equipped with substantial data resources, often seen in collaborations within hospital networks. In contrast, the second encompasses widely distributed clients, characterized by more limited data holders, often associated with collaborations within clinical facilities or on mobile platforms. We investigated the performance of FL on the two learning paradigms by varying client group sizes while maintaining a fixed total training data volume. The results are summarized in Fig. 1, revealing a consistent trend: notably, larger models, such as those backed by BERT and GPT-2 architectures, exhibited great resilience to fluctuations in federation scales. In contrast, the lightweight model, as of BiLSMT-CRF, was susceptible to alterations of scale, resulting in a rapid deterioration in performance as the number of participating clients increased.

**Fig. 1: Performance of FL models with varying numbers of clients.**

Comparison of FedAvg and FedProx with data heterogeneity

Biomedical texts often exhibit high specialization due to distinct protocols employed by different hospitals when generating medical records, resulting in great variations—sublanguage differences. Therefore, FL practitioners should account for such data heterogeneity when implementing FL in healthcare systems. We simulated a real non-IID scenario by emulating BC2GM and JNLPBA as two clients and jointly performing FL. We considered two FL algorithms including FedAvg and FedProx; both are widely deployed in practice. For comparison, we also studied a simulated IID setting using the 2018 n2c2 dataset by random splitting. Detailed analysis of the non-IID/IID distribution can be found in Supplementary Fig. 1 and Supplementary Table 3. As shown in Table 3, we observed that the performance of FedProx was sensitive to the choice of the hyper-parameter μ. Notably, a smaller μ consistently resulted in improved performance. When μ was carefully selected, FedProx outperformed FedAvg when the data were non-IID distributed (lenient F1 score of 0.994 vs. 0.934 and strict F1 score of 0.901 vs. 0.884). However, the difference between the two algorithms was mostly indistinguishable when the data were IID distributed (lenient F1 score of 0.880 vs. 0.879 and strict F1 score of 0.820 vs. 0.818).

Table 3 Comparison of FedAvg with centralized learning and single-client learning using BioBERT

Full size table

Impact of the LM size on the performance of different training schemes

We investigated the impact of model size on the performance of FL. We compared 6 models with varying sizes, with the smallest one comprising 20 M parameters and the largest one comprising 334 M parameters. We picked the BC2GM dataset for illustration and anticipated similar trends would hold for other datasets as well. As shown in Fig. 2, in most cases, larger models (represented by large circles) overall exhibited better test performance than their smaller counterparts. For example, BlueBERT demonstrated uniform enhancements in performance compared to BiLSTM-CRF and GPT2. Among all the models, BioBERT emerged as the top performer, whereas GPT-2 gave the worst performance.

**Fig. 2: Comparison of model performance with different sizes, measured by the number of trainable parameters on the BC2GM dataset.**

Comparison between FL and LLM

In light of the well-demonstrated performance of LLMs on various linguistic tasks, we explored the performance gap of LLMs to the smaller LMs trained using FL. Notably, it is usually not common to fine-tune LLMs due to the formidable computational costs and protracted training time. Therefore, we utilized in-context learning that enables direct inference from pre-trained LLMs, specifically few-shot prompting, and compared them with models trained using FL. We followed the experimental protocol outlined in a recent study³² and evaluated all the models on two NER datasets (2018 n2c2 and NCBI-disease) and two RE datasets (2018 n2c2, and GAD). The results, as summarized in Fig. 3, show that (1) a longer prompt with more input examples (e.g., 10-shot and 20-shot) often enhances the performance of LLMs; and (2) FL, whether implemented with a BERT-based model (BlueBERT) or GPT-based model (GPT-2), consistently outperformed LLMs by a large margin.

Fig. 3: Comparison of LLMs using few-shot prompting and small LMs (BlueBERT and GPT-2) trained with FL on NER (upper) and RE (lower) tasks evaluated based on the F1-score (lenient matching for NER tasks).

Discussion

In this study, we visited FL for biomedical NLP and studied two established tasks (NER and RE) across 7 benchmark datasets. We examined 6 LMs with varying parameter sizes (ranging from BiLSTM-CRF with 20 M to transformer-based models up to 334 M parameters) and compared their performance using centralized learning, single-client learning, and federated learning. On almost all the tasks, we showed that federated learning achieved significant improvement compared to single-client learning and oftentimes performed comparably to centralized learning without data sharing, demonstrating it as an effective approach for privacy-preserved learning with distributed data. The only exception is in Table 2, where the best single-client learning model (check the standard deviation) outperformed FedAvg when using BERT and Bio_ClinicalBERT on EUADR datasets (the average performance was still left behind, though). We believe this is due to the lack of training data. As each client only owned 28 training sentences, the data distribution, although IID, was highly under-represented, making it hard for FedAvg to find the global optimal solutions. Surprisingly, FL achieved reasonably good performance even when the training data was limited (284 total training sentences from all clients), confirming that transfer learning from either the general text domain (e.g., BERT and GPT-2) or biomedical text domain (e.g., BlueBERT, BioBERT, Bio_ClinicalBERT) is beneficial to the downstream biomedical NLP task and pretraining on medical data often gives a further boost. Another interesting finding is that GPT-2 always gave inferior results compared to BERT-based models. We believe this is because GPT-2 is pre-trained on text generation tasks that only encode left-to-right attention for the next word prediction. However, this unidirectional nature prevents it from learning more about global context, which limits its ability to capture dependencies between words in a sentence.

In the sensitivity analysis of FL to client sizes, we found there is a monotonic trend that, with a fixed number of training data, FL with fewer clients tends to perform better. For example, the classical BiLSTM-CRF model (20 M), with a fixed number of total training data, performs better with few clients, but performance deteriorates when more clients join in. It is likely due to the increased learning complexity as FL models need to learn the inter-correlation of data across clients. Interestingly, the transformer-based model (≥108 M), which is over 5 sizes larger compared to BiLSMT-CRF, is more resilient to the change of federation scale, possibly owing to its increased learning capacity.

We analyzed the performance of FedProx in real-world non-IID scenarios and compared it with FedAvg to study the behavior of different FL algorithms under data heterogeneity. Although the FedProx achieved slightly better performance than FedAvg when the data were non-IID distributed, it is very sensitive to the hyper-parameter μ, which strikes to balance the local objective function and the proximal term. Specifically, when data was IID, and μ was set to a large value (e.g., μ = 1), FedProx yielded a 2.4% lower lenient F1-score compared to FedAvg. When the data were non-IID, this performance gap further widened to 5.4%. It is also noteworthy that when μ is set to 0, and all the clients are forced to perform an equal number of local updates, FedProx essentially reverts to FedAvg.

We also investigated the impact of model size on the performance of FL. We observed that as the model size increased, the performance gap between centralized models and FL models narrowed. Interestingly, BioBERT, which shares the same model architecture and is similar in size to BERT and Bio_ClinicalBERT, performs comparably to larger models (such as BlueBERT), highlighting the importance of pre-training for model performance. Overall, the size of the model is indicative of its learning capacity; large models tend to perform better than smaller ones. However, large models require longer training time and more computation resources, which results in a natural trade-off between accuracy and efficiency.

Compared with LLMs, FL models were the clear winner regarding prediction accuracy. We hypothesize that LLMs are mostly pre-trained on the general text and may not guarantee performance when applied to the biomedical text data due to the domain disparity. As LLMs with few-shot prompting only received limited inputs from the target tasks, they are likely to perform worse than models trained using FL, which are built with sufficient training data. To close the gap, specialized LLMs pre-trained on medical text data³³ or model fine-tuning³⁴ can be used to further improve the LLMs’ performance. Another interesting fact is that with more input examples (e.g., 10-shot and 20-shot), LLMs often demonstrate increased prediction performance, which is intuitive as LLMs receive more knowledge, and the performance should be increased accordingly.

While seeing many promising results of FL for LMs, we acknowledge our study suffers from the following limitations: (1) most of our experiments, excluding the non-IID study, are conducted in a simulated environment with synthetic data split, which may not perfectly align with the distribution patterns of real-world FL data. (2) we mostly focused on horizontal FL but have not extended to vertical FL³⁵. (3) we have not considered FL combined with privacy techniques such as differential privacy³⁶ and homomorphic encryption³⁷. To address these limitations and further advance our understanding of FL for LMs, our future study will focus on the real-world implementation of FL and explore the practical opportunities and challenges in FL, such as vertical FL and FL combined privacy techniques. We believe our study will offer comprehensive insights into the potential of FL for LMs, which can serve as a catalyst for future research to develop more effective AI systems by leveraging distributed clinical data in real-world scenarios.

Methods

NLP tasks and corpora

We compared FL with alternative training schemes on 8 biomedical NLP datasets with a focus on two NLP tasks: NER (5 corpora) and RE (3 corpora). The NER and RE are two established tasks for information extraction in biomedical NLP. Given an input sequence of tokens, the goal of NER is to identify and classify the named entities, such as diseases and genes, present in the sequence. RE is often the follow-up task that aims to discover the relations between pairs of named entities. For example, a gene-disease relation (BRCA1-breast cancer) can be identified in a sentence: “Mutations of BRCA1 gene are associated with breast cancer”. For RE tasks, we take the entity positions as given and formulate the problem as follows: given a sentence and the spans of two entities, the task is to determine the relationship between the two entities.

We select the corpora using the following protocols: (1) Publicity. The corpora should be publicly available to ensure that the results obtained are reproducible. (2) Popularity. The corpora should be used in other well-cited papers so that the quality of the data is ensured. (3) Diversity. The corpora should represent as many as the real-world biomedical NLP tasks. A summary of the selected datasets can be found in Table 4; we defer to Supplementary Methods for more detailed descriptions of each dataset.

Table 4 List of corpora and their statistics

Full size table

Federated learning algorithms

FL represents a family of algorithms that aims to train models in a distributed environment in a collaborative manner. Consider a scenario where there are K clients with distributed data $D=\{{D}_{1},{D}_{2},...,{D}_{k}\}$, where ${D}_{i}={D}_{{X}_{i}\times {Y}_{i}}$, and ${X}_{i}$ and ${Y}_{i}$ are the input and output space, respectively. The typical FL aims to solve the optimization problem as in Eq. (1)

$${\sum }_{i=1}^{K}{P}_{i}{F}_{i}\left(w\right)\,{\rm{where}}\,{F}_{k}={\sum }_{j=1}^{\left|{D}_{k}\right|}{L}_{w}\left({X}_{j},{Y}_{j}\right),$$

(1)

where w denote the weights of the model being learned, ${F}_{i}$ is the local objective of the ith clients, and p_i is the weight of the ith clients such that ${p}_{i} \,>\, 0$ and ${\sum }_{i=1}^{K}{p}_{i}=1$. The weights are usually determined by the quantity of clients’ training samples. For example, it equals $\frac{1}{K}$ when clients share the same amount of training data.

In an FL game, there are two types of players: server and client. The server is the compass that navigates the whole process of FL including signaling the start and end of federated learning, synchronizing the local model updates, and dispatching the updated models. The clients are responsible for fetching models from the server, updating models using their local data, and sending the updated models back to the server.

Throughout the whole process, there are four steps: (1) the clients use their own data to optimize the local objectives—local updates, (2) local clients upload the updated model or gradients to the server, (3) the server acquires the local models and synchronize the updates—model aggregation, and (4) server dispatch the models to the clients. While different FL algorithms may have specialized designs for local updates or model aggregation, they share the same training paradigm.

We considered the two most popular FL algorithms called Federated Averaging (FedAvg)³⁰ and another variant FedProx³¹. FedAvg is the most basic and standard FL algorithm that uses stochastic gradient descent (SGD) to progressively update the local model. More specifically, each client locally takes a fixed number of gradient descent steps on their local model using their local training data. On another hand, the server will aggregate these local models by taking the weighted average as the resulting new model for the next round. However, in FedAvg, the number of local updates can be determined by the size of the data. When the size of the data varies, the local updates performed locally can be significantly different. FedProx was introduced to tackle the issue of heterogeneous local updates in FedAvg. By adding a proximal term to the objective of the local update, the impact of variable local updates is suppressed. More specifically, at iteration t, the inner local updates are trying to find the solution that minimizes the objective, as shown in Eq. (2)

$${\rm{Min}}_{w}\frac{1}{{n}_{k}}{\sum }_{i=1}^{{n}_{k}}{L}_{w}\left({X}_{i},{Y}_{i}\right)+\frac{\mu }{2}\left|\left|w-{w}^{t}\right|\right|,$$

(2)

where ${w}^{t}$ is the weights of the network from iteration t. A comparison of FedAvg and FedProx can be found in Algorithm 1 and Algorithm 2.

Algorithm 1

Federated learning algorithms (FedAvg/FedProx)

Notation: ${X}_{i}$ indicates data from client i, K is the total number of clients, T is the maximum training round, n is the sum of ${n}_{1}$ to ${n}_{k}$, ${p}_{i}$ is the weights for the ith client

Initialize server model weights w(1)

Initialize client model weights ${w}_{i}\,\forall\, i={1,2},\ldots ,K$

For each round t = 1, 2, … T do

Send server model weight $w(t)$ to each client

For each client $k={1,2},\ldots ,K$ do

Client k perform LocalUpdate $({X}_{k},{Y}_{k},{w}_{k})$ ← Algorithm 2

End for

$w\left(t+1\right)=\mathop{\sum }\nolimits_{i=1}^{K}{p}_{i}{w}_{i}$ ← model aggregation

End for

Algorithm 2

Local model training using mini-batch stochastic gradient descent (LocalUpdate) (FedAvg/FedProx)

Notation: R is the local update round, B is the number of batches, ${f}_{{w}_{r}}$ is the neural network parameterized by ${w}_{r}$, $\eta$ is the learning rate, $\mu$ is the hyper-parameter in FedProx

For each round $r=\mathrm{1,2},\ldots ,R$ do / Repeat until find the approximate minimizer of ${\boldsymbol{w}}\approx {\boldsymbol{argmi}}{{\boldsymbol{n}}}_{{\boldsymbol{w}}}{\boldsymbol{L}}({{\boldsymbol{f}}}_{{{\boldsymbol{w}}}_{{\boldsymbol{r}}}}({{\boldsymbol{X}}}_{{\boldsymbol{b}}}),{{\boldsymbol{Y}}}_{{\boldsymbol{b}}})$ $+\frac{{\boldsymbol{\mu }}}{{\boldsymbol{2}}}{{||}{{\boldsymbol{w}}}_{{\boldsymbol{k}}}-{{\boldsymbol{w}}}_{{\boldsymbol{k}}}\left({\boldsymbol{t}}\right){||}}^{{\boldsymbol{2}}}$

Randomly shuffle ${X}_{k}$ and create B batches $(({X}_{1},{Y}_{1}),({X}_{2},{Y}_{2}),\ldots ,({X}_{B},{Y}_{B}))$

${L}_{{w}_{r}}=L({f}_{{w}_{r}}({X}_{b}),{Y}_{b})$ ${\boldsymbol{+}}\frac{{\boldsymbol{\mu }}}{{\boldsymbol{2}}}{{||}{{\boldsymbol{w}}}_{{\boldsymbol{k}}}{\boldsymbol{-}}{{\boldsymbol{w}}}_{{\boldsymbol{k}}}\left({\boldsymbol{t}}\right){||}}^{{\boldsymbol{2}}}\,$

For each mini-batch $b=\mathrm{1,2},\ldots ,B$ do

${w}_{r+1}={w}_{r}-\eta {\nabla L}_{{w}_{r}}({X}_{b},{Y}_{b})$

End for

Study design

As shown in Fig. 4, we explored three learning methods: (1) federated learning, centralized learning, and single-client learning. To simulate the conventional learning scenario, we varied the data scale and conducted the following experiments: centralizing all client data to train a single model (centralized learning) and training separate models on each client’s local data (single-client learning).

**Fig. 4: A comparison of centralized learning, federated learning, and single-client learning.**

Models

To better understand the effect of LMs on FL, we chose models with various sizes of parameters from 20 M to 334 M, including Bidirectional Encoder Representations from Transformer (BERT)¹, and Generative Pre-trained Transformer (GPT)³⁸, as well as classical RNN-based model like BiLSTM-CRF³⁹. BERT-based models utilize a transformer encoder and incorporate bi-directional information acquired through two unsupervised tasks as a pre-training step into its encoder. Different BERT models differ in their pre-training source dataset and model size, deriving many variants such as BlueBERT¹², BioBERT⁸, and Bio_ClinicBERT⁴⁰. BiLSTM-CRF is the only model in our study that is not built upon transformers. It is a bi-directional model designed to handle long-term dependencies, is used to be popular for NER, and uses LSTM as its backbone. We selected this model in the interest of investigating the effect of federation learning on models with smaller sets of parameters. For LLMs, we selected GPT-4, PaLM 2 (Bison and Unicorn), and Gemini (Pro) for assessment as both can be publicly accessible for inference. A summary of the model can be found in Table 5, and details on the model description can be found in Supplementary Methods.

Table 5 List of LMs used for comparison

Full size table

Training details

Data preprocessing

we adapted most of the datasets from the BioBERT paper with reasonable modifications by removing the duplicate entries and splitting the data into the non-overlapped train (80%), dev (10%), and test (10%) datasets. The maximum token limit was set at 512, with truncation—coded sentences with lengths larger than 512 were trimmed.

Federated learning simulation

We considered two different learning settings: learning from IID data and learning from non-IID data. For the first setting, we randomly split the data into k folds uniformly. For most of our experiments, k was chosen as 10, while we also varied k from 2 to 10 to study the impact of the size of the federation. For the second setting, we considered learning from heterogeneous data collected from different sources. This represents the real-world scenario where complex and entangled heterogeneities are co-existed. We picked BC2GM and JNLPBA as two independent clients, both targeting the same gene entity recognition tasks but were collected from different sources. To show that they are non-IID distributed, we have conducted data distribution analysis (i.e., calculate the distribution distance and plot t-SNE on embedded features space), which can be found in Supplementary Discussions.

LLMs with few-shot prompting

We followed a similar experiment protocol as in the previous study³². Figure 5 shows an example of applying few-shot prompting in a LLM to solve an NER task. A RE task can be solved similarly by changing the task description, and input-output pairs. Notably, we simulate 1-/5-/10-/20-shot prompting by varying the number of input examples that are randomly selected from the training dataset. For model evaluation, we randomly selected 200 test samples in the test dataset and reported the prediction performance over the selected samples.

**Fig. 5: An example of applying few-shot prompting in an LLM to solve an NER task.**

Training models

For models that require training, we used Adam to optimize our models with an initial learning rate of 0.001 and momentum of 0.9. The learning rate was scheduled by linear_scheduler_with_warmup. All experiments were performed on a system equipped with an NVIDIA A100 GPU and an AMD EPYC 7763 64-core Processor.

Reported evaluation

For NER, we reported the performance of these metrics at the macro average level with both strict and lenient match criteria. Strict match considers the true positive when the boundary of entities exactly matches with the gold standard, while lenient considers true positives when the boundary of entities overlaps between model outputs and the gold standard. For all tasks, we repeated the experiments three times and reported the mean and standard deviation to account for randomness.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

All the datasets involved in this study are publicly available from the following official websites: 2018 n2c2: https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/. BC2GM: https://biocreative.bioinformatics.udel.edu/tasks/. BC4CHEMD: https://biocreative.bioinformatics.udel.edu/resources/biocreative-iv/chemdner-corpus/. JNLPBA: http://www.geniaproject.org/shared-tasks/bionlp-jnlpba-shared-task-2004. NCBI-disease: https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/. EUADR: https://biosemantics.erasmusmc.nl/index.php/resources/euadr-corpus. GAD: https://maayanlab.cloud/Harmonizome/dataset/GAD+Gene-Disease+Associations.

Code availability

Our project codes are publicly available on Github: Train and evaluate FL models: https://github.com/PL97/FedNLP. Texts preprocessing: https://github.com/PL97/Brat2BIO. Evaluation: https://github.com/PL97/NER_eval. LLMs evaluations: https://github.com/GaoxiangLuo/LLM-BioMed-NER-RE.

References

Devlin, J. et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” North American Chapter of the Association for Computational Linguistics 4171–4186 (2019).
Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. Improving language understanding by generative pre-training. Preprint at arXiv https://arxiv.org/pdf/2012.11747 (2010)..
Sun, C. et al. “How to Fine-Tune BERT for Text Classification?” China National Conference on Chinese Computational Linguistics (2019).
Xu, H. et al. “BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis.” North American Chapter of the Association for Computational Linguistics (2019).
Dathathri, S. et al. Plug and play language models: a simple approach to controlled text generation. Findings of the Association for Computational Linguistics: EMNLP pp. 3973–3997 (2021).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: evaluating text generation with BERT. International Conference on Learning Representations (2020).
Shi, P. & Lin, J. Simple BERT models for relation extraction and semantic role labeling. Preprint at arXiv http://arxiv.org/abs/1904.05255 (2019).
Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).
Article CAS PubMed Google Scholar
Huang, K., Altosaar, J. & Ranganath, R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. Preprint at arXiv https://doi.org/10.48550/arXiv.1904.05342 (2020).
Yang, W. et al. End-to-end open-domain question answering with BERTserini. In Proceedings of the 2019 Conference of the North 72–77. https://doi.org/10.18653/v1/N19-4013 (2019).
Qu, C. et al. BERT with history answer embedding for conversational question answering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval 1133–1136 (ACM, 2019). https://doi.org/10.1145/3331184.3331341.
Peng, Y. et al. “Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets.” Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (2019).
Tinn, R. et al. Fine-tuning large neural language models for biomedical natural language processing. Patterns 4, 100729 (2023).
Article CAS PubMed PubMed Central Google Scholar
A Study of Abbreviations in Clinical Notes—PMC. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2655910/.
Reisman, M. EHRs: the challenge of making electronic data usable and interoperable. Pharm. Ther. 42, 572–575 (2017).
Google Scholar
Zhou, S., Wang, N., Wang, L., Liu, H. & Zhang, R. CancerBERT: a cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records. J. Am. Med. Inform. Assoc. JAMIA 29, 1208–1216 (2022).
Article PubMed Google Scholar
Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 1–23 (2022).
Article Google Scholar
Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016).
Article CAS PubMed PubMed Central Google Scholar
Lafferty, J. D., McCallum, A. & Pereira, F. C. N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning 282–289 (Morgan Kaufmann Publishers Inc., 2001).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Article CAS PubMed Google Scholar
The genetic association database—PubMed. https://pubmed.ncbi.nlm.nih.gov/15118671/.
Konečný, J. et al. Federated learning: strategies for improving communication efficiency. NIPS Workshop on Private Multi-Party Machine Learning (2016).
Peng, L. et al. Evaluation of federated learning variations for COVID-19 diagnosis using chest radiographs from 42 US and European hospitals. J. Am. Med. Inform. Assoc. 30, 54–63 (2023).
Article Google Scholar
Long, G, et al. "Federated learning for open banking." Federated Learning: Privacy and Incentive 240–254 (2020).
Nguyen, A. et al. “Deep Federated Learning for Autonomous Driving.” 2022 IEEE Intelligent Vehicles Symposium (IV), 1824–1830 (2021).
Liu, M. et al. Federated learning meets natural language processing: a survey. Preprint at arXiv http://arxiv.org/abs/2107.12603 (2021).
Lin, B. Y. et al. “FedNLP: Benchmarking Federated Learning Methods for Natural Language Processing Tasks.” NAACL-HLT (2021).
Sui, D. et al. FedED: federated learning via ensemble distillation for medical relation extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2118–2128 (Association for Computational Linguistics, 2020) https://doi.org/10.18653/v1/2020.emnlp-main.165.
Liu, D. & Miller, T. Federated pretraining and fine tuning of BERT using clinical notes from multiple silos. Preprint at arXiv http://arxiv.org/abs/2002.08562 (2020).
McMahan, B., Moore, E., Ramage, D., Hampson, S. & Arcas, B. A. Y. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics 1273–1282 (PMLR, 2017).
Li, T. et al. "Federated optimization in heterogeneous networks." Proceedings of Machine Learning and Systems 2, 429–450 (2020).
Chen, Q. et al. "Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations." Preprint at arXiv https://arxiv.org/pdf/2305.16326 (2023).
Yang, X. et al. A large language model for electronic health records. Npj Digit. Med. 5, 1–9 (2022).
Article Google Scholar
Large language models encode clinical knowledge. Nature. https://www.nature.com/articles/s41586-023-06291-2.
Yang, Q., Liu, Y., Chen, T. & Tong, Y. Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. 10, 1–19 (2019).
Article Google Scholar
Federated Learning With Differential Privacy: Algorithms and Performance Analysis. IEEE J. Mag. https://ieeexplore.ieee.org/document/9069945.
Zhang, C. et al. “BatchCrypt: Efficient Homomorphic Encryption for Cross-Silo Federated Learning.” USENIX Annual Technical Conference (2020).
Radford, A. et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 9 (2019).
[1508.01991] Bidirectional LSTM-CRF Models for Sequence Tagging. https://arxiv.org/abs/1508.01991.
Alsentzer, E. et al. Publicly available clinical BERT embeddings. Preprint at arXiv http://arxiv.org/abs/1904.03323 (2019).
Henry, S., Buchan, K., Filannino, M., Stubbs, A. & Uzuner, O. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. J. Am. Med. Inform. Assoc. 27, 3–12 (2020).
Article PubMed Google Scholar
Smith, L. et al. Overview of BioCreative II gene mention recognition. Genome Biol. 9, S2 (2008).
Article PubMed PubMed Central Google Scholar
Krallinger, M. et al. The CHEMDNER corpus of chemicals and drugs and its annotation principles. J. Cheminformatics 7, S2 (2015).
Article Google Scholar
Collier, N., Ohta, T., Tsuruoka, Y., Tateisi, Y. & Kim, J.-D. Introduction to the bio-entity recognition task at JNLPBA. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP) 73–78 (COLING, 2004).
Doğan, R. I., Leaman, R. & Lu, Z. NCBI disease corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inform. 47, 1–10 (2014).
Article PubMed PubMed Central Google Scholar
van Mulligen, E. M. et al. The EU-ADR corpus: annotated drugs, diseases, targets, and their relationships. J. Biomed. Inform. 45, 879–884 (2012).
Article PubMed Google Scholar
OpenAI. GPT-4 Technical Report. Preprint at arXiv http://arxiv.org/abs/2303.08774 (2023).
Anil, R. et al. PaLM 2 Technical Report. Preprint at arXiv https://doi.org/10.48550/arXiv.2305.10403 (2023).
Gemini Team et al. Gemini: a family of highly capable multimodal models. Preprint at arXiv http://arxiv.org/abs/2312.11805 (2023).

Download references

Acknowledgements

This work was in part supported by Cisco Research under award number 1085646 PO USA000EP390223 and the National Cancer Institute within the National Institutions of Health under award number 1R01CA287413-01. The authors acknowledge the Minnesota Supercomputing Institute (MSI) at the University of Minnesota for providing resources that contributed to the research results reported in this paper. The study was also supported by the UMN’s Center for Learning Health System Sciences.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA
Le Peng & Ju Sun
Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, USA
Gaoxiang Luo
Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA
Sicheng Zhou & Jiandong Chen
Nvidia Corporation, Santa Clara, CA, USA
Ziyue Xu
Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN, USA
Rui Zhang

Authors

Le Peng
View author publications
You can also search for this author in PubMed Google Scholar
Gaoxiang Luo
View author publications
You can also search for this author in PubMed Google Scholar
Sicheng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jiandong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ziyue Xu
View author publications
You can also search for this author in PubMed Google Scholar
Ju Sun
View author publications
You can also search for this author in PubMed Google Scholar
Rui Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

L.P. was responsible for the overall experimental design, F.L. implementation, and writing of the paper. G.L. was responsible for the LLM prompt design, LLM experiment, evaluation, and editing of the paper. S.Z. and R.Z. contributed to the data collection and editing of the paper. J.C., Z.X. and J.S. contributed to the editing of the paper and idea discussion.

Corresponding authors

Correspondence to Ju Sun or Rui Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Reporting Summary

supplemental materials

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Peng, L., Luo, G., Zhou, S. et al. An in-depth evaluation of federated learning on biomedical natural language processing for information extraction. npj Digit. Med. 7, 127 (2024). https://doi.org/10.1038/s41746-024-01126-4

Download citation

Received: 01 November 2023
Accepted: 23 April 2024
Published: 15 May 2024
DOI: https://doi.org/10.1038/s41746-024-01126-4

Subjects

Abstract

Similar content being viewed by others

Testing theory of mind in large language models and humans

Highly accurate protein structure prediction with AlphaFold

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Introduction

Results

FedAvg, single-client, and centralized learning for NER and RE tasks

Influence of FL scale on the performance of LMs

Comparison of FedAvg and FedProx with data heterogeneity

Impact of the LM size on the performance of different training schemes

Comparison between FL and LLM

Discussion

Methods

NLP tasks and corpora

Federated learning algorithms

Algorithm 1

Algorithm 2

Study design

Models

Training details

Data preprocessing

Federated learning simulation

LLMs with few-shot prompting

Training models

Reported evaluation

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Reporting Summary

supplemental materials

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links