A study of generative large language model for medical research and healthcare

There are enormous enthusiasm and concerns in applying large language models (LLMs) to healthcare. Yet current assumptions are based on general-purpose LLMs such as ChatGPT, which are not developed for medical use. This study develops a generative clinical LLM, GatorTronGPT, using 277 billion words of text including (1) 82 billion words of clinical text from 126 clinical departments and approximately 2 million patients at the University of Florida Health and (2) 195 billion words of diverse general English text. We train GatorTronGPT using a GPT-3 architecture with up to 20 billion parameters and evaluate its utility for biomedical natural language processing (NLP) and healthcare text generation. GatorTronGPT improves biomedical natural language processing. We apply GatorTronGPT to generate 20 billion words of synthetic text. Synthetic NLP models trained using synthetic text generated by GatorTronGPT outperform models trained using real-world clinical text. Physicians’ Turing test using 1 (worst) to 9 (best) scale shows that there are no significant differences in linguistic readability (p = 0.22; 6.57 of GatorTronGPT compared with 6.93 of human) and clinical relevance (p = 0.91; 7.0 of GatorTronGPT compared with 6.97 of human) and that physicians cannot differentiate them (p < 0.001). This study provides insights into the opportunities and challenges of LLMs for medical research and healthcare.

3][4] People are enthusiastic about the potential of using LLMs to facilitate documentation of patient reports (e.g., a progress report), 3,4 improving diagnostic accuracy, 5 and assisting in various clinical care, 6,7 while at the same time concerning about the hallucinations and fabrications, 7,8 bias and stereotype, 9 and risks of patient privacy and ethics. 10t, this enthusiasm and concerns are based on a general-purpose LLM ChatGPT, which is not designed for healthcare use since only a small fraction of biomedical text was used. 1 Until now, it is unclear how this disruptive technology can help medical research and potentially improve the quality of healthcare.
Language model is a simple statistical distribution used in natural language processing (NLP) to formulate the probability of a sequence of words or the next word in a sequence.Surprisingly, when it is used as a self-supervised learning objective to train a specific neural network architecture named transformer, and when the model size is very large such as billions or hundreds of billions of parameters, important artificial intelligence (AI) emerge.2][13] The pretrained transformer architecture is known as generative LLM as it can generate human-like text.The conversational ability of LLMs is achieved using prompt-based text generation, 14 the key technology guiding LLMs to generate reasonable answers and contextual contents.
This study aims to develop a generative LLM in the medical domain and evaluate its utility for medical research and healthcare.We trained a generative LLM, namely GatorTronGPT, using 82 billion words of de-identified clinical text 15 from University of Florida (UF) Health and 195 billion diverse English words from the Pile 16 dataset.We trained GatorTronGPT from scratch using the GPT-3 17 architecture (used by ChatGPT) and examined how the text generation ability of GatorTronGPT benefit medical research and healthcare.We formulated biomedical relation extraction and question answering using a unified text generation architecture 18 to evaluate how GatorTronGPT could benefit medical research using 6 benchmark datasets.To examine the utility of text generation in the clinical domain, we applied GatorTronGPT to generate 20 billion words of synthetic clinical text, which were used to train synthetic NLP models, denoted as GatorTronS ('S' stands for synthetic).We compared GatorTronS models with GatorTron, 15 a clinical NLP model trained with the same architecture but using real-world 90 billion words of text, on 5 different clinical NLP tasks to test the hypothesis that generative clinical LLMs can be used to generate synthetic clinical texts useful for clinical research.To test if LLMs could be used in healthcare, two internal medicine subspecialists from endocrinology (NSO) and cardiology (MMA) manually evaluated 60 clinical paragraphs including 30 paragraphs written by GatorTronGPT randomly mixed with 30 real-world paragraphs written by UF Health physicians.Fig. 1 shows an overview of the study design.To our best knowledge, GatorTronGPT is the first generative LLM developed in the clinical domain using the GPT-3 architecture with 20 billion parameters, providing valuable insights on the opportunities and challenges of generative LLMs for medical research and healthcare.GatorTronGPT outperformed all existing transformer models on 3 datasets, where the GatorTronGPT with 20 billion parameters achieved the best F1-score of 0.500, 0.494, and 0.419, respectively.GatorTronGPT improved state-of-the-art by 3%-10% compared with the second-best bioGPT 18 model.We consistently observed performance improvement when scaling up the size of GatorTronGPT.Table 1.b compares GatorTronGPT with six existing biomedical transformers using three benchmark datasets for biomedical question answering.The GatorTronGPT model with 20 billion parameters achieved the best performance of 0.451, as a tie with BioLinkBERT, for the MedQA dataset, and achieved the second-best performance of 0.776 for the PubMedQA dataset.The performance of GatorTronGPT on the MedMCQA dataset is lower than a much larger LLM Galactica with 120 billion parameters.We observed a monotonic performance improvement by scaling up the size of GatorTronGPT.We generated 20 billion words of synthetic clinical text using GatorTronGPT.Tables 2 and 3 compare GatorTronS trained with different sizes of synthetic clinical text with ClinicalBERT and the original GatorTron, 15  The Turing test results show that, on average, less than half (49.2%) of the clinical notes were identified correctly, including 36.7% of the synthetic notes and 61.7% of the human notes (Table 4.a).Among the 30 synthetic notes written by GatorTronGPT, 9 (30.0%) and 13 (43.4%)were correctly labeled as 'AI' by the two physicians, respectively.Among the 30 human notes written by physicians, 17 (56.7%)and 20 (66.7%) were correctly labeled as 'Human', respectively.

We
Considering GatorTronGPT was considered as a human for more than 30% of the instances (the criteria from Turing test), 25 GatorTronGPT passed the Turing test (p < 0.001).

Discussion
This study develops a generative clinical LLM, GatorTronGPT, using the GPT-3 architecture 13 with 277 billion words of clinical mixed with English text.We evaluate GatorTronGPT for Generative LLMs aspire to become a "Unified Field Theory" to unify most fundamental NLP tasks using a single model architecture.It might be still early to judge if LLMs will became the one and only foundation model 12 for NLP, but it looks like we are closer than any time.
Generative LLMs have the potential to impact medical research in many aspects.In addition to performance improvement demonstrated in this study, generative LLMs provide a generalizable way for biomedical NLP using prompt-based text generation, 27 which have better few-shot learning and transfer learning ability to deliver portable clinical NLP systems.9][30] The prompt-based text generation of LLMs can potentially help compose treatment plans by integrating instructions from clinical guidelines and patient's historical records in EHRs.The conversation ability of LLMs provides opportunities developing intelligent EHR systems with human-like communication, 2 where healthcare providers, patients, and other stakeholders can communicate with electronic health record (EHR) systems in an intelligent EHR systems.Industry stakeholders such as Epic and Nuance have been reported to be exploring these potentials. 31,32r Turing test focuses on (1) comparing synthetic and human notes in terms of linguistic readability and clinical relevance; and (2) testing whether physicians can differentiate synthetic and human notes.The statistical tests show that there are no significant differences in linguistic readability (p = 0.22; 6.57 of GatorTronGPT compared with 6.93 of human) or clinical relevance (p = 0.91; 7.0 of GatorTronGPT compared with 6.97 of human).Further, physicians cannot differentiate them (p < 0.001), suggesting the potential utility of GatorTronGPT for text generation in healthcare.Two physician evaluators find that the text written by GatorTronGPT generally lack clinical logic, indicating that more research and development are needed to make this technology useful for healthcare.Our Turing test focuses on statistical differences not utility in real-word clinical practice, which should be examined in future studies when this technology matures.Current general-purpose LLMs are designed for conversation as a chatbot outside of healthcare as there is only a small amount of biomedical text in the development dataset.
Therefore, current use of ChatGPT for healthcare is more like a typical case of intended use versus actual use as described in the medical device regulation. 33Domain-specific LLMs are required for clinical applications.Due to the probabilistic nature of text generation, LLMs are prone to confabulation or hallucination, which might be amusing as chatbots but dangerous for healthcare.Future studies should examine strategies to control the hallucinations under a minimal level to make LLMs safe for healthcare.Like any medical AI applications, it is necessary to carefully examine potential limitations, biases, and risks of this disruptive new technology to guide its application and make it "approved " AI-enabled medical device 34 if it turns out could help healthcare.We evaluated the text generation capacity of GatorTronGPT without using human instructions, which is a typical zero-shot learning setting.Future studies should examine if the clinical text generation can be improved and controlled using human instructions such as reinforcement learning from human feedback 35 (RLFHF, used by ChatGPT) and P-tuning 36 algorithms.

Data Source
This study uses a large collection of 82 billion words of clinical narratives from UF Health Integrated Data Repository (IDR) and 195 billion words of diverse English words from the Pile 16 corpus.This study was approved by the UF Institutional Review Board (IRB202102223).At UF Health, we collected approximately 290 million clinical notes from 2011-2021 from over 126 departments, approximately 2 million patients and 50 million encounters from inpatient, outpatient, and emergency settings.The detailed patient distribution by age, gender, race, ethnicity; clinical notes distribution by note type, and clinical department can be accessed from our previous study 15 .We merged the UF Health clinical corpus with the Pile 16 dataset to generate a large corpus with 277 billion diverse clinical and English words.We performed minimal preprocessing for the Pile dataset and applied a de-identification system to remove 18 PHI categories defined in the Health Insurance Portability and Accountability Act (HIPAA) from the UF Health notes.The detailed preprocessing steps are described in the Supplement.

Train GatorTronGPT from scratch
Configuration We trained GatorTronGPT using two configurations (5 billion parameters and 20 billion parameters) and determined the number of layers, hidden sizes, and number of attention heads according to the guidelines for optimal depth-to-width parameter allocation proposed by Levin et al 37 as well as our previous experience in developing GatorTron 15 .The 5 billion model has 24 layers, hidden size of 4,096, and number of attention heads of 32; the 20 billion model has 44 layers, hidden size of 6,144, and number of attention heads of 48.We trained the 5 billion model using a 2-way tensor model parallel with a batch size of 1,120 and learning rate of 1.200E-05.We trained the 20 billion model using an 8-way tensor model parallel with a batch size of 560 and a learning rate of 1.000E-05.We adopted a dropout rate of 0.1.

Training from scratch
We inherited the GPT-3 architecture implemented in the MegaTron-LM 38 and trained GatorTronGPT models from scratch with the default GPT-3 loss function. 13 used a total number of 560 NVIDIA DGX A100 GPUs from 70 superPOD nodes at UF's

GatorTronGPT for end-to-end biomedical relation extraction and question answering
End-to-end relation extraction is an NLP task to identify the triplets <concept1, concept2, relation> from biomedical text.Question answering is to identify the answer for a given question and the context.Following previous studies 18,39 , we approached the two tasks using a unified prompt-based text generation architecture.Specifically, we adopted a fixed-LLM prompt-tuning strategy 40 to attach a continuous embedding (i.e., virtue tokens) to the input sequence [virtual tokens; x; y] as a soft prompt to control the text generation; the LLM was not changed during training.We provide details in the Supplement.
Task 1 -End-to-end biomedical relation extraction.We compared the two GatorTronGPT models with four existing transformer models including GPT-2, 41 REBEL, REBEL-pt, 27 and BioGPT 18 on three biomedical tasks for end-to-end relation extraction using 3 benchmark datasets including drug-drug interaction 42 (DDI), BioCreative V chemical-disease relation 43 (BC5CDR), and drug-target interaction 44 (KD-DTI) Task 2 -Biomedical question answering.We compared GatorTronGPT with six existing transformer models using three widely used benchmark dataset including PubMedQA 45 -a biomedical question answering dataset collected from PubMed abstracts, which requires answering questions with 'yes/no/maybe' ; MedMCQA 46 -a large-scale multi-choice question answering dataset designed to address real world medical entrance exam questions covering 2,400 healthcare topics and 21 medical subjects; and MedQA-USMLE 47 -a multi-choice dataset collected from the professional medical board exams.These three question answering datasets have been widely used by recent studies 18,[45][46][47] for evaluation of generative LLMs.

Task 3 -GatorTronGPT for synthetic clinical text generation
We sought to test the hypothesis that LLMs can generate synthetic clinical text to train synthetic NLP models useful for medical research.We applied GatorTronGPT to generate synthetic clinical text according to a set of seeds without any fine-tuning, which is a typical zero-shot learning setting.Then, using the generated synthetic clinical text, we trained synthetic transformer-based NLP models using our previous BERT-based GatorTron architecture 15 , denoted as GatorTronS ('S' stands for synthetic).We trained GatorTronS models using different sizes of synthetic clinical text and compared them with the original GatorTron-base models trained using real-world text to examine how the size of synthetic clinical text affect the performance.To make it comparable, we trained GatorTronS using the same architecture and number of parameters (i.e., 345 million) as the GatorTron-base architecture.We provide detailed information in the Supplement.

Synthetic clinical text generation
Following previous studies 48 , we approached synthetic clinical text generation as an iterative sampling procedure and applied top-p (i.e., nucleus sampling) sampling and temperature sampling to balance the diversity and quality of clinical text generation. 48We set the parameter of top-p sampling at 0.9 and the parameter for temperature sampling at 1.2 according to our empirical assessment.We sampled the beginning 15 tokens from all sections of the de-identified notes of the MIMIC III database 49 and generated approximately 8 million prompts.We also tried several random seeds in GatorTronGPT to generate multiple documents from one prompt.We limited our clinical text generation up to 512 tokens and stopped generation when the maximum length was reached.We provide detailed information in the Supplement.

Synthetic NLP model development
We controlled the generation to generate different sizes of synthetic clinical text including 1 billion, 5 billion, 10 billion, and 20 billion words of clinical text and developed corresponding synthetic NLP models, denoted as GatorTronS.Following our previous study 15 , we trained GatorTronS using the same architecture of GatorTron -a BERT architecture with 345 million parameters.

Comparison with existing transformer models
We compared GatorTronS trained using different amount of synthetic clinical text data with

Task 4 -Turing test of text generation for clinical practice
We randomly sampled 30 narrative sections of real-world UF Health clinical notes, including "past medical history", "history of present illness", "assessment/plan", and "chief complaint".
For each of the 30 sections, we extracted the beginning 15 tokens as a seed for GatorTronGPT to generate a synthetic paragraph up to 512 tokens.We cut off the 30 real-world clinical sections to 512 tokens, removed all format information, and randomly mixed them with 30 synthetic sections written by GatorTronGPT.Two UF Health physicians (NSO, MMA) manually reviewed the 60 paragraphs of notes to evaluate: (1) linguistic readability on a 1(worst) to 9 (best) scale, (2) clinical relevance and consistency on a 1 to 9 scale, (3) determine if it was written by a human physician or GatorTronGPT.Percent agreement and Gwet's AC1 were calculated to evaluate interrater reliability. 51

Introduction to existing transformer models for comparison
GPT-2.GPT-2 was trained using text data from 8 million webpages with 1.5 billion parameters, which is a scale-up of the first generation of GPT45 model.The GPT model outperformed previous transformer models on 9 out of 12 NLP tasks, whereas, the GPT-2 model further demonstrated text generation ability, which laid foundation for complex NLP tasks such as machine reading comprehension and question answering.
trained GatorTronGPT using 5 billion and 20 billion parameters with 277 billion words of mixed clinical and general English text.Training the 5 billion model used approximately 6 days and the 20 billion model used about 20 days on 560 A100 80G GPUs from 70 NVIDIA DGX notes using the NVIDIA SuperPOD reference cluster architecture.Fig. 2 shows the training and validation loss for the two sizes of GatorTronGPT models.

Fig 1 .Fig. 2
Fig 1. Develop a clinical generative large language model, GatorTronGPT, for biomedical natural language processing, clinical text generation, and healthcare text evaluation.a, Train GatorTronGPT from scratch using GPT-3 architecture with up to 20 billion parameters.b, Solve biomedical relation extraction and question answering using a unified P-tuning base text generation architecture.c, Apply GatorTronGPT to generate 20 billion words of synthetic clinical text, which was used to train synthetic natural language processing model, GatorTronS.d, Turing evaluation of 30 paragraphs of text written by GatorTronGPT mixed with 30 real-world paragraphs written by UF Health physicians.TrM: transformer unit; B: billion our previously released clinical LLM trained using real-world clinical text.For clinical concept extraction, the GatorTronS trained using 20 billion synthetic clinical text achieved the best F1-score for two out of three benchmark datasets, and GatorTronS trained using five billion synthetic clinical text achieved the best F1-score for 1 (the 2018 n2c2 challenge) out of three benchmark datasets.GatorTronS outperformed the original GatorTron model by >1% F1-score on all three benchmark datasets.For medical relation extraction, the GatorTronS trained using 10 billion synthetic clinical text achieved the best F1-score of 0.962 on the 2018 n2c2 challenge benchmark dataset, which is comparable with the original GatorTron model (0.960).For semantic textual similarity and natural language inference, the GatorTronS trained using 20 billion synthetic clinical text achieved the best evaluation scores, outperforming the original GatorTron by >1%.For question answering, the GatorTronS trained using 10 billion synthetic clinical text achieved the best score for emrQA benchmark focusing on medications, and the exact match evaluation for relation; the GatorTronS trained using 20 billion synthetic clinical text achieved the best evaluation score in F1-score evaluation on the emrQA relation benchmark dataset.GatorTronS outperformed the original GatorTron model trained using realworld clinical text > 1%.The comparison of GatorTronS models trained using different size of synthetic clinical text shows that by generating a minimum of 5 billion synthetic clinical text, we can train a synthetic GatorTronS model with comparable performance to GatorTron, a same size and architecture transformer trained using 90 billion words of clinical mixed with general English text.

4 .
Turing test results.a. Number and percentage of correctly identified notes; b.Means and standard deviations of the quality measures; c.Two examples of synthetic clinical text generated by GatorTronGPT.The text generation stops at maximum 512 tokens.Pass Turing test: both physicians labeled as 'Human'; Fail Turing Test: both physicians labeled as 'AI'.
medical research and healthcare focusing on the key function of text generation.GatorTronGPT achieves state-of-the-art performance for 4 out 6 biomedical NLP benchmark datasets, demonstrating the benefit for medical research.The experimental results show that GatorTronGPT can generate synthetic clinical text for developing of synthetic clinical NLP models (i.e., GatorTronS), which achieve better or comparable performance with NLP models trained using real-world clinical text, demonstrating the utility of synthetic clinical text generation for clinical research.The physicians' evaluation of synthetic clinical text show that GatorTronGPT can generate clinical contents with linguistic readability comparable to realworld clinical notes.This study provides valuable insights regarding the opportunities and challenges of generative LLMs for medical research and healthcare.We discover an important utility of generative LLMs for synthetic clinical text generation.There has been a gap in accessing large-scale clinical text and sharing clinical NLP models due to the sensitive nature of clinical text and the fact that automatic de-identification systems cannot remove 100% protected health information (PHI).Our study shows that GatorTronS, a synthetic transformer model trained using 5 billion words of synthetic clinical text generated by GatorTronGPT, can achieve better or comparable performance on 5 clinical NLP tasks compared with GatorTron 15 , a same-structure and size transformer model trained using a much larger realworld clinical text (90 billion words).Potential reasons may include (1) real-world clinical text has redundancies, and (2) GatorTronGPT generates more diverse synthetic clinical text.A previous study 26 has reported that by augmenting real-world clinical training data using additional human annotated synthetic text generated by a smaller generative LLM, GPT-2, NLP models can achieve better performance.Our study further demonstrates that, without additional human annotation and augmentation of training data, a larger clinical GPT-3 model can generate synthetic clinical text to train synthetic NLP models outperforming NLP models trained using real-world clinical text.Text generation using clinical LLMs mitigates the risk of exposing patient privacy to improve accessing of large-scale clinical text and sharing of state-of-the-art NLP models, thus enabling the next generation clinical text analytics approaches for medical research.
HiPerGator-AI cluster to train GatorTronGPT by leveraging both data-level and model-level parallelisms implemented by the Megatron-LM package 38 .(See https://github.com/NVIDIA/Megatron-LMfor more details) We monitored the training progress by training loss and validation loss using 3% of the data and stopped the training when there was no further improvement.

ClinicalBERT 50 -
a clinical transformer model trained using biomedical literature and clinical notes from the MIMIC III database, and GatorTron 15 , the current largest clinical transformer model trained using >90 billion words of text, using 5 clinical NLP tasks including clinical concept extraction (or named entity recognition [NER]), medical relation extraction, semantic textual similarity, natural language inference, and question answering.
REBEL and REBEL-pt.REBEL is a transformer model based on the BART architecture designed for end-to-end relation extraction using sequence-to-sequence modeling, which outperformed previous relation extraction models based on classifications.REBEL-pt is an enhanced version of REBEL by further fine-tuning it using the triplets derived using Wikipedia hyperlinks.BioGPT.BioGPT is a domain-specific generative transformer-based LLM developed using the GPT-2 architecture and the Pubmed biomedical literature, which achieved good performance in NLP tasks including relation extraction and question answering in the biomedical domain.

Table 1 .
Comparison of GatorTronGPT with existing transformer models for a. biomedical relation extraction and b. question answering.

Table 2 .
Comparison of GatorTronS with existing transformer-based LLMs for clinical concept extraction and medical relation extraction.
B: billion words of text; Clinical concepts in 2010 i2b2 and 2012 i2b2 challenges: problems, treatments, lab tests; clinical concepts in 2018 n2c2 challenge: drugs, adverse events, and drug-related attributes (e.g., dose).Medical relation in 2018 n2c2 challenge: drug induced adverse events; B: billion words of text.Best evaluation scores are bolded.NA: scores not reported.

Table 3 .
Comparison of GatorTronS with existing transformer-based LLMs for semantic textual similarity, natural language inference, and question answering.

textual similarity Natural language inference Question answering 2019 n2c2 22 MedNLI 23 emrQA Medication 24 emrQA Relation 24
B: billion words of text.The best evaluation scores are bolded.

Table 4 .
bsummarizes the means and standard deviations of the linguistic readability and clinical relevance and consistency.Statistical tests show that there is no significant difference between notes written by GatorTronGPT and human physicians in both linguistic readability (p = 0.22) and clinical relevance and consistency (p = 0.91).Table4.cshowstwoexamples of clinical paragraphs written by GatorTronGPT.Percent agreement and interrater reliability were found to be good or excellent, as summarized in Supplement TablesS1 and S2.
a. Percentage of notes correctly identified by human reviewers.