Background & Summary

While reliable resources for health information conveyed in a plain language format exist, such as the MedlinePlus website from the National Library of Medicine (NLM)1, these resources do not provide all the necessary information for every health-related situation or rapidly changing state of knowledge arising from novel scientific investigations or global events like pandemics. In addition, the language used in other health-related articles can be too difficult for patients and the general public to comprehend2, which has a major impact on health outcomes3. While work in simplifying text exists, the unique language of biomedical text warrants a distinct subtask similar to machine translation, termed adaptation4. Adapting natural language involves creating a simplified version that maintains the most important details from a complex source. Adaptations are a common tool for teachers to use to improve comprehension of content for English language learners5.

A standard internet search will return multiple scientific articles that correspond to a patient’s query; however, without extensive clinical and/or biological knowledge, the user may not be able to comprehend the scientific language and content6. There are articles with verified, plain language summaries for health information, such as the articles with corresponding plain language summaries created by medical health organization Cochrane7. However, creating manual summaries and adaptations for every article addressing every user’s queries is not possible. Thus, an automatic adaptation generated for material responding to a user’s query is very relevant, especially for patients without clinical knowledge.

Though plain language thesauri and other knowledge bases have enabled rule-based systems that substitute difficult terms for more common ones, human editing is needed to account for grammar, context, and ambiguity8. Deep Learning may offer a solution for fully automated adaptation. Advances in architectures, hardware, and available data have led neural methods to achieve state-of-the-art results in many linguistic tasks, including Machine Translation9 and Text Simplification10. Neural methods, however, require large numbers of training examples, as well as benchmark datasets to allow iterative progress11.

Parallel datasets for Text Simplification have been assembled by searching for semantically similar sentences across comparable document pairs, for example articles on the same subject in both Wikipedia and Simple English Wikipedia (or Vikidia, an encyclopedia for children in several languages)12,13,14,15. Since Wikipedia contains some articles on biomedical topics, it has been proposed to extract subsets of these datasets for use in this domain16,17,18,19. However, since these sentence pairs exist in different contexts, they are often not semantically identical, having undergone sentence-level operations like splitting or merging. Sentence pairs pulled out of context may also use anaphora on one side of a pair but not the other. This can confuse models during training and expect impossible replacements during testing. Further, Simple English Wikipedia often still contains complex medical terms on the simple side16,20,21. Parallel sentences have also been mined from dedicated biomedical sources. Cao et al. have expert annotators pinpoint highly similar passages, usually consisting of one or two sentences from each passage, from Merck Manuals, an online website containing numerous articles on medical and health topics created for both professional and general public groups22. In addition, Pattisapu et al. have expert annotators identify highly similar pairs from scientific articles and corresponding health blogs describing them23. Though human filtering makes the pairs in both these datasets much closer to being semantically identical, at less than 1,000 pairs each, they are too small for training and even less ideal for evaluation24. Sakakini et al. manually translate a somewhat larger set (4,554) of instructions for patients from clinical notes25. However, this corpus covers a very specific case within the clinical domain, which itself constitutes a separate sublanguage from biomedical literature26.

Since recent models can handle larger paragraphs, comparable corpora have also been suggested as training or benchmark datasets for adapting biomedical text. These corpora consist of pairs of paragraphs or documents that are on the same topic and make roughly the same points, but are not sentence-aligned. Devaraj et al. present a paragraph level corpus derived from Cochrane review abstracts and their Plain Language Summaries, using heuristics to combine subsections with similar content across the pairs. However, these heuristics do not guarantee identical content27. This dataset is also not sentence-aligned, which limits the architectures that can take advantage of it and results in restriction of documents to those with no more than 1024 tokens. Other datasets include comparable corpora or are created at the paragraph-level and omit relevant details from the original article27. To the best of our knowledge, no datasets provide manual, sentence-level adaptations of the scientific abstracts28. Thus, there is still a need for a high-quality, sentence-level gold standard dataset for the adaptation of general biomedical text.

To address this need, we have developed the Plain Language Adaptation of Biomedical Abstracts (PLABA) dataset. PLABA contains 750 abstracts from PubMed (10 on each of 75 topics) and expert-created adaptations at the sentence-level. Annotators were chosen from the NLM and an external company and given abstracts within their respective expertise to adapt. Human adaptation allows us to ensure the parallel nature of the corpus down to sentence-level granularity, but still while using the surrounding context of the entire document to guide each translation. We deliberately construct this dataset so it can serve as a gold standard on several levels:

  1. 1.

    Document level simplification. Documents are simplified in total, each by at least one annotator, who is instructed to carry over all content relevant for general public understanding of the professional document. This allows the corpus to be used as a gold standard for systems that operate at the document level.

  2. 2.

    Sentence level simplification. Unlike automatic alignments, these pairings are ensured to be parallel for the purpose of simplification. Semantically, they will differ only in (1) content removed from the professional register because the annotator deemed it unimportant for general public understanding, and (2) explanation or elaboration added to the general public register to aid understanding. Since annotators were instructed to keep content within sentence boundaries (or in split sentences), there are no issues with fragments of other thoughts spilled over from neighboring sentences on one side of the pair.

  3. 3.

    Sentence-level operations and splitting. Though rare in translation between languages, sentence-level operations (e.g. merging, deletion, and splitting) are common in simplification29. Splitting is often used to simplify syntax and reduce sentence length. Occasionally sentences may be dropped from the general public register altogether (deletion). For consistency and simplicity of annotation, we do not allow merging, creating a one-to-many relationship at the sentence level.

The PLABA dataset should further enable the development of systems that automatically adapt relevant medical texts for patients without prior medical knowledge. In addition to releasing PLABA, we have evaluated state-of-the-art deep learning approaches on this dataset to set benchmarks for future researchers.

Methods

The PLABA dataset includes 75 health-related questions asked by MedlinePlus users, 750 PubMed abstracts from relevant scientific articles, and corresponding human created adaptations of the abstracts. The questions in PLABA are among the most popular topics from MedlinePlus, ranging from topics like COVID-19 symptoms to genetic conditions like cystic fibrosis1.

To gather the PubMed abstracts in PLABA, we first filtered questions from MedlinePlus logs based on the frequency of general public queries. Then, a medical informatics expert verified the relevance of and lack of accessible resources to answer each question and chose 75 questions total. For each question, the expert coded its focus (COVID-19, cystic fibrosis, compression devices, etc.) and question type (general information, treatment, prognosis, etc.) to use as keywords in a PubMed search30. Then, the expert selected 10 abstracts from PubMed retrieval results that appropriately addressed the topic of the question, as seen in Fig. 1.

Fig. 1
figure 1

Overview representing how questions and PubMed abstracts for the dataset were searched and chosen for annotators to adapt. PMID refers to the PubMed ID from which the example originates from. SID refers to the sentence ID or number of the example sentence from the source abstract.

To create the corresponding adaptations for each abstract in PLABA, medical informatics experts worked with source abstracts separated into individual sentences to create corresponding adaptations across all 75 questions. Adaptation guidelines allowed annotators to split long source sentences and ignore source sentences that were not relevant to the general public. Each source sentence corresponds to no, one, or multiple sentences in the adaptation. Creating these adaptations involved syntactic, lexical and semantic simplifications, which were developed in the context of the entire abstract. Examples taken from the dataset can be seen in Table 1. Specific examples of adaptation guidelines are demonstrated in Fig. 2 and included:

  • Replacing arcane words like “orthosis” with common synonyms like “brace”

  • Changing sentence structure from passive voice to active voice

  • Omitting or incorporating subheadings at the beginning of sentences (e.g., “Aim:”, “Purpose:”)

  • Splitting long, complex sentences into shorter, simpler sentences

  • Omitting confidence intervals and other statistical values

  • Carrying over understandable sentences from the source with no changes into the adaptation

  • Ignoring sentences that are not relevant to a patient’s understanding of the text

  • Resolving anaphora and pronouns with specific nouns

  • Explaining complex terms and abbreviations with explanatory clauses when first mentioned

Table 1 Examples of questions, abstracts, and adaptations in PLABA.
Fig. 2
figure 2

Example of the guidelines set for annotators. PMID refers to the PubMed ID from which the example originates from. SID refers to the sentence ID or number of the example sentence from the source abstract. Target refers to the manual adaptation.

Data Records

We archived the dataset with Open Science Framework (OSF) at https://osf.io/rnpmf/https://osf.io/rnpmf/31. The dataset is saved in JSON format and organized or “keyed” by question ID. Each key is a question ID that contains a corresponding nested JSON object. This nested object contains the actual question, a single-letter key denoting if the question is a clinical question or biological question, and contains the abstracts and corresponding human adaptations grouped by the PubMed ID (PMID) of the abstract. Table 2 shows statistics of the abstracts and adaptations. An example of the data format for one record can be found in the README file in the OSF archive.

Table 2 Average number of words and sentences per data type.

Technical Validation

We measured the level of complexity, the ability to train tools and how well the main points are preserved in the automatic adaptations trained on our data. We first introduce the metrics we used to measure text complexity followed by the metrics to measure text similarity and inter-annotator agreement between manually created adaptations. We use the same text similarity metrics to also compare automatically created adaptations to both the source abstracts and manually created adaptations.

Evaluation metrics

To measure text readability and compare the abstracts and manually created adaptations, we use the Flesch-Kincaid Grade Level (FKGL) test32. FKGL uses the average number of syllables per word and the average number of words per sentence to calculate the score. A higher FKGL score for a text indicates a higher reading comprehension level needed to understand the text.

In addition, we use BLEU33, ROUGE34, SARI4,35, and BERTScore36, commonly used text and semantic similarity and simplification metrics, to measure inter-annotator agreement, compare abstracts to manually created adaptations, and evaluate the automatically created adaptations. BLEU and ROUGE look at spans of contiguous words (referred to as n-grams in Natural Language Processing or NLP) to evaluate a candidate adaptation against a reference adaptation. For instance, BLEU-4 measures how many of the contiguous sequences from one to four words in length in the candidate adaptation appear in the reference adaptation. However, BLEU is a measure of precision and penalizes candidates for adding incorrect n-grams. ROUGE is a measure of recall and penalizes candidate adaptations for missing n-grams. Similarly, BERTScore looks at subwords to evaluate a candidate sentence against a reference sentence, comparing each candidate subword against every reference subword using contextual word embeddings. While BERTScore gives values of precision, recall, and F1 (which averages precision and recall), we solely report F1 metrics. Since BLEU, ROUGE, and BERTScore are not specifically designed for simplification, we also use SARI, which also incorporates the source sentence in order to weight the various operations involved in simplification. While n-grams are still used, SARI balances (1) addition operations, in which n-grams of the candidate adaptation are shared with the reference adaptation but not the source, (2) deletion operations, in which n-grams appear in the source but neither the reference nor candidate, and (3) keep operations, in which n-grams are shared by all three. We report BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L (which measures the longest shared sub-sequence between a candidate and reference), BERTScore-F1, and SARI. All metrics can account for multiple possible reference adaptations.

Text readability

To verify that the human generated adaptations simplify the source abstracts, we calculated the FKGL readability scores for both the adaptations and abstracts. FKGL scores were lower for the adaptations compared to the abstracts (p < 0.0001, Kendall’s tau). It is important to note that FKGL does not measure similarity or content preservation, so additional metrics like BLEU, ROUGE, and SARI are needed to address this concern.

Inter-annotator agreement

To measure inter-annotator agreement, we used adaptions from the most experienced annotator (who also helped define the guidelines) as reference adaptations. Agreement was measured for all abstracts that were adapted by this annotator and another annotator. For the inter-annotator agreement metrics of ROUGE-1, ROUGE-2, ROUGE-L, BLEU-4, and BERTScore-F1, the values ranged from 0.4025–0.5801, 0.1267–0.2983, 0.2591–0.4689, 0.0680–0.2410, and 0.8305–0.9476, respectively, for all adaptations that were done by the reference annotator and another annotator. As the ROUGE-1 results show, the other annotators included, on average, about half of the words that the reference annotator used. As expected, ROUGE-2 values are lower, on average, because as n-grams increase in n, there will be less similarity between adaptations since individuals may use different combinations of words when creating new text.

We also calculated the similarity between human adaptations and the source abstracts. Using the abstracts as candidates and adaptations as references since BLEU-4 can only match multiple references to a single candidate and not vice versa, the scores in Table 3 show the adaptations contain over half of the same words, a third of the same bi-grams, and a large portion of the same subwords as the source abstracts.

Table 3 ROUGE-1, ROUGE-2, ROUGE-L, BLEU-4, and BERTScore-F1 using human adaptations as references and abstracts as candidates.

While ROUGE and BLEU are metrics for text similarity and BERTScore measures semantic similarity, they do not necessarily measure correctness. Even if a pair of adaptations have a low ROUGE, BLEU, or BERTScore score, both could be accurate restatements of the source abstract as seen in Fig. 3. While the BLEU-4 score can be low, both adaptations can relevantly describe the topic in response to the example question. The differences between the adaptations can be attributed to synonyms and differences in explanatory content. While BLEU and ROUGE are useful for measuring lexical similarity, calculating differences between adaptations like these is more nuanced. To address this issue, researchers are actively developing new metrics37.

Fig. 3
figure 3

Example of the low BLEU-4 score between human adaptations from two different annotators created from the same source abstract and answering the same question. PMID refers to the PubMed ID from which the example originates from. SID refers to the sentence ID or number of the example sentence from the source abstract. Colored text in an adaptation represents parts of the adaptation that strongly differ from the other adaptation.

Experimental benchmarking

To benchmark the PLABA dataset and show its use in evaluating automatically generated adaptations, we used a variety of state-of-the-art deep learning algorithms listed below:

Text-to-text transfer transformer (T5)

T538 is a transformer-based39 encoder-decoder model with a bidirectional encoder setup similar to BERT40 and an autoregressive decoder that is similar to the encoder except with a standard attention mechanism. Instead of training the model on a single task, T5 is pre-trained on a vast amount of data and on many unsupervised and supervised objectives, including token and span masking, classification, reading comprehension, translation, and summarization. The common feature of every objective is that the task can be treated as a language-generation task, in which the model learns to generate the proper textual output in response to the textual prompt included in the input sequence. As with other models, pre-training has been shown to achieve state-of-the-art results on many NLP tasks37,38,41. When the T5 model is fine-tuned on a specific dataset for a specific task, the task’s objective (e.g., translate from English to French, summarize, etc.) is prepended with a colon to the input text as a prompt to guide the T5 model during training and testing. In our experiments, we use the T5-Base model with the prompt “summarize:” since it is the closest prompt to the task of plain language adaptation that the T5 model was pre-trained on. We also show the performance of a T5 model not fine-tuned on our training data (T5-No-Fine-Tune) to compare it to a T5 model fine-tuned on PLABA to demonstrate the importance of training models on our dataset given recent developments in out-of-box or zero-shot settings42,43.

Pre-training with extracted gap-sentences for abstractive summarization sequence-to-sequence (PEGASUS)

PEGASUS44 is another transformer-based encoder-decoder model; however, unlike T5, PEGASUS is pre-trained on a unique self-supervised objective. With this objective, entire sentences are masked from a document and collected as the output sequence for the remaining sentences of the document. In other words, PEGASUS is designed for abstractive summarization and similar tasks, achieving human performance on multiple datasets. In our experiments, we use the PEGASUS-Large model.

Bidirectional autoregressive transformer (BART)

BART45 is another transformer-based encoder-decoder that is pre-trained with a different objective. Instead of training the model directly on data with a text-to-text objective or summarization-specific objective, BART was pre-trained on tasks such as token deletion and masking, text-infilling, and sentence permutation. These tasks were developed to improve the model’s ability to understand the content of text before summarizing or translating it. After this pre-training, BART can be fine-tuned for downstream tasks of summarization or translation with a more specific dataset to output higher quality text. These datasets include the CNN Daily Mail46 dataset, a large news article dataset designed for summarization tasks. In our experiments, we use the BART-Base model and BART-Large model fine-tuned on the CNN Daily Mail dataset (BART-Large-CNN).

T Zero plus plus (T0PP)

T0PP47 is a variation of the original T5 encoder-decoder model created for zero-shot performance, or out-of-box performance on certain tasks and datasets without prior fine-tuning or training. To develop this zero-shot model, T0PP was trained on a subset of tasks (e.g., sentiment analysis, question answering) and evaluated on a different subset of tasks (e.g., natural language inference). In our experiments, we use the T0PP model with 3 billion parameters without fine-tuning on our dataset and with the same prompt “summarize:” as the T5 models to maintain consistency across prompt-based models.

Experimental setup

For our experiments, all deep learning models except for T0PP and T5-No-Fine-Tune were trained using the abstracts and adaptations in the PLABA dataset. Each PubMed abstract is used as the source document, and the human generated adaptations are used as the references. The dataset was divided such that 70% was used for training, 15% for validation, and 15% for testing. In addition, the stratified split was performed such that all abstracts and adaptations of each question were grouped and exclusively contained in the training, validation, or testing set. We utilized the pre-trained models from Hugging Face48, and each trained model was fine-tuned with the AdamW optimizer and the default learning rate of 5e-5 for 20 epochs using V100X GPUs (32 GB VRAM) on a shared cluster. Maximum input sequence length was set to 512 tokens except for the BART models, in which the maximum was set to 1024. Validation loss was measured every epoch, and the checkpoint model with the lowest validation loss was used for test set evaluation. Each trained model was also randomly seeded with 3 different sets of initial parameters to assess model performance variability. In addition, the inputs and output of the models will vary between training and testing. If a model is being trained, its two inputs per training step will be the source abstract and its respective human generated adaptations. The output is the model’s automatically generated adaptation, which will be compared to the human generated adaptation to evaluate how close the output is to that input. The model is rewarded for how similar the output is to the gold-standard human generated adaptation. While training occurs with the training dataset, the model is periodically evaluated with the validation set to monitor performance during training. If a model is being tested, its input will just be the source abstract, while its output continues to be the model’s automatically generated adaptation. All metrics (except SARI) will compare the output to the human generated adaptations to calculate the score. For SARI, this metric will compare the output to the human generated adaptations and source abstract to generate a score. While trained models will first be trained on the training and validation sets and then tested on the test set, zero-shot models like T0PP and T5-No-Fine-Tune will skip training and immediately be tested on the test set. A visual overview of the experiments can be seen in Fig. 4.

Fig. 4
figure 4

Overview representing how PubMed abstracts and human adaptations are split for training and testing models.

Results

Table 4 shows the FKGL scores between the automatically generated adaptations, all of which were significantly lower than the abstracts except from T5-No-Fine-Tune and significantly higher than the manually crafted adaptations except from T5-No-Fine-Tune (p < 0.05, Kendall’s tau). Table 5 shows the comparison between the automatically generated adaptations and the human generated adaptations with ROUGE and BLEU and the comparison between the automatically generated adaptations, human generated adaptations, and source abstracts with SARI. Table 6 shows the comparison between the automatically generated adaptations and the source abstracts with ROUGE and BLEU. It is interesting to note that the automatically generated adaptations from the trained models and T0PP are more readable than the abstracts but less readable than the human generated adaptations according to FKGL scores. However, the T5 variant without fine-tuning generated adaptations less readable than even the source abstracts. Thus, the dataset gives the models sufficient training data to develop outputs that outperform the source abstracts in terms of readability. Regarding SARI, the trained models tend to perform comparably in terms of simplification. In terms of ROUGE, BLEU, and BERTScore, the automatically generated adaptations tend to share more n-grams and subwords with the source abstracts rather than the human generated adaptations. This relationship is potentially because the abstracts tend to be shorter than the adaptations, as seen in Table 2. This may make it easier for the automatically generated adaptations to share more contiguous word sequences with the abstracts relative to the human generated adaptations. In addition, the choice of metrics used for evaluation will influence the reported performance of a model. However, across all metrics in Tables 4, 5, both zero-shot models T5-No-Fine-Tune and T0PP performed significantly worse compared to the trained models (p < 0.0001, Wilcoxon signed-rank test).

Table 4 FKGL scores for automatically generated adaptations.
Table 5 Automatically generated adaptations compared to human adaptations and (only for SARI) source abstracts.
Table 6 Automatically generated adaptations compared to source abstracts.

An example of the automatically generated adaptations from each model in response to the same abstract is shown in Table 7. The generated adaptations from the zero-shot models show visibly fewer sentences, less details, and less explanations than generated adaptations from the trained models. These demonstrate that the PLABA dataset, in addition to being a high-quality test set, is useful for training generative deep learning models with the objective of text adaptation of scientific articles. Since there are no existing manually crafted datasets for this objective, PLABA can be a valuable dataset for benchmarking future research in this domain.

Table 7 Examples of adaptations created by PEGASUS, T5, BART-Base, BART-Large-CNN, T5-No-Fine-Tune, T0PP.

Usage Notes

We have added instructions in the README file of our OSF repository that show how to use the PLABA dataset. Pre-processing the dataset and evaluating adaptation algorithms on it can be located in the code scripts at our GitHub repository given below. To reproduce the experimental results, users can download the data from the OSF repository, download the code scripts from the GitHub repository, and run the code scripts on their machine to train and benchmark the models with the same results.