Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

Recent immense breakthroughs in generative models such as in GPT4 have precipitated re-imagined ubiquitous usage of these models in all applications. One area that can benefit by improvements in artificial intelligence (AI) is healthcare. The note generation task from doctor-patient encounters, and its associated electronic medical record documentation, is one of the most arduous time-consuming tasks for physicians. It is also a natural prime potential beneficiary to advances in generative models. However with such advances, benchmarking is more critical than ever. Whether studying model weaknesses or developing new evaluation metrics, shared open datasets are an imperative part of understanding the current state-of-the-art. Unfortunately as clinic encounter conversations are not routinely recorded and are difficult to ethically share due to patient confidentiality, there are no sufficiently large clinic dialogue-note datasets to benchmark this task. Here we present the Ambient Clinical Intelligence Benchmark (aci-bench) corpus, the largest dataset to date tackling the problem of AI-assisted note generation from visit dialogue. We also present the benchmark performances of several common state-of-the-art approaches.


Background & Summary
Healthcare needs are an inescapable facet of daily life.Current patient care at the medical facilities requires involvement not only from a primary care provider, but also from pharmacy, billing, imagining, labs, and specialist care.For every encounter, a clinical note is created as documentation of clinician-patient discussions, patient medical conditions.They serve as a vital record for clinical care and communication with patients and other members of the care team, as well as outline future plans, tests, and treatments.Similar to typical meeting summaries, these documents should highlight important points while compressing itemized instances into condensed themes; unlike typical meeting summaries, clinical notes are purposely and technically structured into semi-structured documents, contain telegraphic and bullet-point phrases, use medical jargon that do not appear in the original conversation, and will reference outside information often from the electronic medical record, including prose-written content or injections of structured data.
While the widespread adoption of electronic health records (EHR's), spurred by the HITECH Act of 2009, has led to greater health information availability and interoperability, it has also spawned a massive documentation burden shifted to clinicians.Physicians have expressed concerns that writing notes in electronic health records (EHRs) takes more time than using traditional paper or dictation methods.As a result, notes may not be completed and accessible to other team members until long after rounds 1 .Furthermore, as another unintended consequence of EHR use complications, electronic notes have been criticized for their poor readability, completeness, and excessive use of copy and paste 2 .To save time and adequately capture details, clinicians may choose to write their notes during their time with a patient.This may detract from the clinicians' attention toward the patient (e.g. in reading non-verbal cues), and may leave patients feeling a want of empathy 3 .Alternatively, some clinicians or provider systems may hire medical assistants or scribes to partake in some or all of the note creation process, which has been linked with improved productivity, increased revenue, and improved patient-clinician interactions 4 .However such systems are both costly and, more importantly, often require a substantial investment in time from the providers in managing and training their scribes 5 -a problem that is often multiplied by the high attrition rates in the field.
One promising solution is the use of automatic summarization to capture and draft notes, before being reviewed by a clinician.This technology has attracted increasing attention in the last 5 years as a result of several key factors: (1) the improvement of speech-to-text technology, (2) widespread adoption of electronic medical records in the United States, (3) the rise of transformer models.Several works have adopted early technology in this area, including use of statistical machine translation methods, use of RNNs, transformers, and pre-trained transformer models [6][7][8][9][10][11] .
However, a massive bottleneck in understanding the state-of-the-art is the lack of publicly share-able data to train and evaluate 12 .This challenge is inherent in the required data's characteristics as (1) meeting audio and transcripts from medical encounters are not typically recorded and saved, and (2) medical information is highly personal and sensitive data and cannot be easily, ethically shared publicly.Private companies may construct or acquire their own private datasets; however, results and algorithms cannot be systematically compared.Recent ground-breaking performances by large language models such as ChatGPT and GPT4 provide promising general model solutions; however without common datasets that may be studied publicly it would be impossible for the scientific community to understand strength, weaknesses, and future directions.dataset description src-len (tok/turns) target-len (tok/sent) size open MTS-dialogue 13 dialogue-note snippets where conversations are created using clinical note sections  6 real clinical dictation-note pairs 616/1 550/-9875 N Nuance 7 real clinical dialogue-note pairs 972 avg/-452 total/-1 802k N Table 1.Comparable corpora for doctor-patient dialogue2note generation.The majority of datasets are proprietary and unshare-able for community evaluation.(src-len=source/transcript length, target-len=target/note length, -=unreported) In this paper, we present the Ambient Clinical Intelligence Benchmark (ACI-BENCH) corpus.The corpus, created from domain experts, is designed to model three variations of model-assisted clinical note generation from doctor-patient conversations.These include conversations with (a) calls to a virtual assistant (e.g.required use of wake words or prefabricated, canned phrases), (b) unconstrained directions or discussions with a scribe, and (c) natural conversations between a doctor and patient.We also provide data to experiment between using human transcription and automatic speech recognition (ASR); or between ASR and corrected ASR.Table 1 shows a comparison of the 8 corpora described in state-of-the-art work.Only two other similar corpora are publicly available.primock57 14 contains a small set of 57 encounters.MTS-dialog 13 contains ∼1700 samples however its focus is on on dialogue snippets rather than full encounters.To our knowledge, ACI-BENCH is the largest and most comprehensive corpus publicly available for model-assisted clinical note generation.
In the following sections, we provide details of the ACI-BENCH Corpus.We (1) discuss the dataset construction and cleaning, (2) provide statistics and the corpus structure, (3) describe our content validation methods and comparison with real data, (4) quantify several diverse baseline summarization methods on this corpus.

Data Creation
Clinical notes may be written by the physician themselves or in conjunction with a medical scribe or assistant; alternatively physicians may choose to dictate the contents of an entire note to a human transcriptionists or an automatic dictation tool.In cases with human intervention, scribe-assisted or transcriptionist-assisted cases, physician speech may include a mixture of commands (e.g."newline", "add my acne template"), free-text requiring almost word-for-word copying (e.g."To date, the examinee is a 39 year-old golf course maintenance worker") 6 , or free-text communication to the medical assistance (e.g."let's use my normal template, but only keep the abnormal parts", "can you check the date and add that in?").With trained medical scribes participating in the clinic visit, in addition to directions from the doctor, they are expected to listen in on the patient-doctor dialogue and generate clinical note text independently.To mirror this reality, the ACI-BENCH corpus consists of three subsets representing common modes of note generation from doctor-patient conversations: virtual assistant (virtassist): In this mode, the doctor may use explicit terms to activate a virtual assistance device (e.g."Hey Dragon show me the diabetes labs") during the visit.This necessitates some behavioral changes on the part of the provider.
virtual scribe (virtscribe): In this mode, the doctor may expect a separate scribe entity (automated or otherwise) to help create the clinical note.This subset is characterized by pre-ambles (e.g.short patient descriptions prior to a visit) and after-visit dictations (e.g.used to specify non-verbal parts of the visit such as the physical exam or to dictate the assessment and plan).The rest of the doctor-patient conversation will be natural and undisturbed.ambient clinical intelligence (aci): This data is characterized by natural conversation between a patient and a doctor; without explicit calls to a virtual assistant or additional language addressed to a scribe.
Transcripts from subsets virtassist and virtscribe were created by a team of 5+ medical experts including medical doctors, physician assistance, medical scribes, and clinical informaticians based on experience and studying real encounters.Subset aci was created with a certified doctor and a volunteer lay person, who must role-play a real doctor-patient encounter, given a list of symptom prompts.Clinical notes were generated using an automatic note generation system and checked and re-written by domain experts (e.g.medical scribes, or physicians).The virtscribe dataset includes the human transcription as well as an ASR transcript; meanwhile the virtassist and aci subsets were created with only a human transcription and ASR transcript available, respectively.

Data Cleaning and Annotation
Our final dataset was distilled from encounters originally created for marketing demonstration purposes.During this initial dataset creation, imaginary EHR injections were placed within the note to contribute to realism, though many without basis from the conversation.Although EHR inputs, independent from data intake from a conversation, are a critical aspect of real clinical notes, in this dataset we do not model EHR input or output linkages with the clinical note (e.g.smart links to structured data such as vitals values, structured survey data, order codes, and diagnosis codes).In the original demo data, these items were added for realism without basis in the source text, the doctor-patient conversation.Some unsupported items were purposely left unmarked in cases where removal would lead to note quality / meaning degradation.After human annotated text-span level identification, these were automatically removed from the clinical note.
In order to identify unsupported information of note text to the transcript, we created systematic annotation guidelines for As an example, in the left note, "past medical history" contents are written in the "history" portion of the note on the right.To seperate the full note target into smaller text and minimize data sparsity problems if modeling by individual sections, notes are partitioned into separate SUBJECTIVE, OBJECTIVE_EXAM, OBJECTIVE_RESULTS, and ASSESSMENT_AND_PLAN continuous divisions.This also allows evaluation and generation at a higher granularity compared to a full note level.
labeling unsupported note sentences.These unsupported information included items such as reasoning for treatment (which may not be part of the original conversation) or could be information from imaginary EHR inputs (e.g.vitals).Examples of the different types of unsupported information are included in Table 2.We tasked four independent annotators with medical backgrounds to complete this task.The partial span overlap agreement was 0.85 F1.Marked text spans were removed during automatic processing.Because the datasets were originally created and demonstrated for a short period, as such, these notes were created under greater time constraints and less review.To ensure quality, four annotators identified and corrected note errors, such as inconsistent values.Finally, as the ACI-BENCH dataset used ASR transcripts, there were cases where the note and the transcript information would conflict due to ASR errors.For example, "hydronephosis" in the clinical note may be wrongly automatically transcribed as "high flow nephrosis".Another example may be a names; "Castillo" may be transcribed as "kastio".As part of this annotation, we tasked annotators to identify these items and provide corrections.After annotation, the data was processed such that note errors were corrected and unsupported note sentences were removed.To study the effect of ASR errors, ASR transcripts were processed into two versions: (a) original and (b) ASR-corrected (ASR outputs corrected by humans).After automatic processing, encounters were again manually reviewed for additional misspelling and formatting issues.

Note Division Definition
Motivated by a need to simplify clinical note structure, improve sparsity problems, and simplify evaluation, in this section, we describe our system for segmenting a full clinical note into continuous divisions.
Clinical notes are semi-structural documents with hierarchical organization.Each physician, department, and institution may have their own set of commonly used formats.However, no universal standard exists 15 .The same content can appear in multiple forms structured under different formats.This is illustrated in the subjective portions of two side-by-side notes in Figure 1.In this example, contextual medical history appears in their own sections (e.g."current complaint (cc)" and "history of present illness (hpi)", "past medical history") in the report on the left; and merged into one history section in the report on the right.These variations in structure pose challenges for both generation and evaluation.Specifically, if evaluating by fine-grained sections in the reference, it is possible that generated notes may include the same content in other sections.Likewise, generating with fine-grained sections would require sufficient samples from each section; however as not every note has every type of section -the sample size becomes sparser.Finally, it is important to note, current state-of-the-art pre-trained embedding based evaluation metrics (e.g.bertscore, bleurt, bart-score) are limited by the original trained sequence length which are typically shorter than our full document lengths.This is illustrated in Figure 2, where for one system (Text-davinci-003) the length of the concatenated reference and system summaries will typically far exceed the typical pre-trained BERT-based 512 subtoken limit.
To simplify training and evaluation, as well as maintain larger samples of data, we partition notes and group multiple sections together into four divisions, as shown in Figure 1 These divisions were inspired by the SOAP standard, where the SUBJECTIVE includes items taken during verbal exam and typically written in the chief complaint, history of present Figure 2. BERT subtoken lengths of concatenated gold/system summaries (test1 Text-davinci-003 system) for doctor-patient dialogue to clinical note generation task.As embedding-based models require encoding the concatenated reference and hypothesis, on this dataset it would be difficult to fairly evaluate the corpus using current pretrained BERT models which have a 512 subtoken limit.illness, and past social history; the OBJECTIVE_EXAM includes content from the physical examination on the day of the visit; the OBJECTIVE_RESULTS includes diagnostics taken prior to the visit, including laboratory or imaging results; and the ASSESSMENT_AND_PLAN includes the doctor's diagnosis and planned tests and treatments 16 .In our dataset, the divisions are contiguous and appear in the order previously introduced.Another practical benefit of partitioning the note into contiguous divisions is the greater ability to leverage pretrained sequence-to-sequence models, typically trained with shorter sequences.Furthermore, evaluation at a sub-note level allows a greater resolution for assessing performances.

Data Statistics
The full dataset was split into train, validation, and three test sets.Each subset was represented in the splits through randomized stratefied sampling.The test sets 1 and 2 corresponds to the test sets from ACL ClinicalNLP MEDIQA-Chat 2023 2 TaskB and TaskC, respectively.Test 3 corresponds to TaskC of CLEF MEDIQA-SUM 2023 3 .The frequency of each data split are shown in Table 3

Data Records
The ACI-BENCH Corpus can be found at [LINK TO BE UPDATED].Code for pre-processing, evaluation, and running baselines can be found in [LINK TO BE UPDATED].

Folder and naming organization
Data used in the ACL-clinicalnlp MEDIQA-CHAT and CLEF MEDIQASUM challenges are located in the challenge_data folder, whereas ASR experiment data is located in the src_experiment_data folder.Each data split has two associated files: a metadata and a data file (further described below).Train, validation, test1, test2, and test3 data files are prefixed with the following names: train, valid, clinicalnlp_taskB_test1, clinicalnlp_taskC_test2, and clef_taskC_test3, respectively.Source experiment data files offer subset-specific versions of train/validation/test in which the transcript may be the alternate forms of ASR or ASR-corrected versions.The naming convention prefix of these is according to the pattern: {split}_{subset}_{transcript-version}.Therefore, for example, train_virtscribe_humantrans.csv will give the training data from the virtscribe subset with the original human transcription version; whereas train_virtscribe_asr.csv will give the ASR transcription version.

Metadata files (*_metadata.csv)
Metadata files include columns for the dataset name (e.g.virtassist, virtscribe, aci), id, encounter_id, doctor_name, pa-tient_firstname, patient_familyname, gender, chief complaint (cc), and secondary complaints (2nd_complaints).Both id and encounter_id can be used to identify a unique encounter.The encounter_id were the identifiers used for the MEDIQA-CHAT and MEDIQASUM 2023 competitions.The id unique identifier will also denote a specific subset.

Transcript/Note files (*.csv)
In the source-target data files, transcript and note text are given along with the dataset name and id or encounter_id.This file may be joined with the metadata files using either id or encounter_id.encounter_id should be used for challenge data, whereas the id should be used for the source experiment data.

Content validation
After dataset creation and cleaning, an additional content validation step was conducted to ensure medical soundness.For each encounter, medical annotators were tasked with reviewing each symptom, test, diagnosis and treatment from the encounter.In cases where the medical annotation specialist was unsure of certain facts (e.g. can drug X be prescribed at the same time as drug Y?), the encounter undergoes two possible additional reviews.Firstly, if the phenomenon in question can be searched identified from a +3M store of propriety clinical notes (which we will refer to as the CONSULT dataset) 4 , we deemed the information credible.Alternatively, if the information is not something that could be identified by the first approach, the question is escalated to a clinical expert annotator.Encounters with unexplainable or severe logical or medical problems identified by a medical annotators were removed (e.g. using a medication for urinary tract infection for upper respiratory infection).

Comparison with real data
To study differences between the ACI-BENCH dataset and a set of real encounters, we conduct statistical comparison with 163 randomly chosen family medicine clinical encounters (including pairs human transcriptions and corresponding clinical notes) with in-depth alignment annotation, from the CONSULT dataset.Tables 4 and 5 show the statistical comparison between the 20 encounters in the validation set (aci-validation) and the CONSULT encounters.In general, the ACI-BENCH dataset had on average shorter notes, at 492 tokens versus 683 tokens for the consult dataset.Except for the OBJECTIVE_RESULTS division, every division was longer in the consult data (Table 4).The ACI-BENCH dataset also exhibits shorter dialogue lengths, by approximately 100 tokens and 20 sentences; as well a shorter notes by approximately 100 tokens (Table 5).One reason for the shorter note length is our removal of unsupported note text.We additionally annotated for alignments of data between the source and target on the validation set (20 encounters) and consult set, similar to that of previous work 17 .This annotation marks associations between note sentences and their corresponding source transcript sentences.Unmarked note sentences indicate that a sentence may be purely structural (e.g.section header) or may include unsupported content.Likewise, unmarked transcript sentences may indicate that the content is superfluous.Comparing the portions of annotated alignments in separate corpora gives indications of corpora similarity with respect to relative content transfer.Other useful metrics which provide measures of alignment/generation difficulty include : (a) the fraction of alignment crossings (whether content appear monotonically versus "out-of-order"/"crossing") 18 ), (b) the similarity of corresponding text segments, and (c) percentage of transcript speech modes.The results of these comparisons are shown in Table 5.
Labeled alignment annotations show that approximately the same fractions of dialogue and note sentences were labeled (0.34 and 0.49 transcript, 0.84 and 0.95 note for the consult and ACI-BENCH corpus respectively); with a high 0.95 fraction for the ACI-BENCH corpus, as designed by the removal of unsupported text.With shorter transcripts (1203 tokens in ACI-BENCH vs 1505 tokens in the CONSULT set), the ACI-BENCH corpus also had a 15% more aligned transcript sentences.The text similarity (Jaccard unigram) of alignments were similar (0.15 and 0.12) as was the fraction of crossing annotations (0.67 and 0.95) for the CONSULT and ACI-BENCH corpus respectively; though the dialogue-note document similarity was higher in the ACI-BENCH corpus.
The percentage of note sentences annotated with different labels 5 show across the board lower percentages in the CONSULT data.This is explainable as the transcript length and thus the percentage of note sentences annotated with a certain label will decrease.However, it is interesting to show that the ACI-BENCH corpus had a higher percentage of note sentences coming from question-answer paired transcript sentences and conversation statements rather than dictation/statement2scribe.For example while in the CONSULT dataset, important QA makes up twice as much transcript sentences as in dictation (15% and 8%), in the ACI-BENCH dataset there are ten times more QA labeled sentences than dictation (43% vs 4%).Meanwhile in the CONSULT dataset, transcript sentences identified with an alignment using the "statement" tag was about three times that of dictation, however this was about seven times in the ACI-BENCH corpus.Together, this data suggests that the ACI-BENCH corpus may be slightly less challenging in terms of documents lengths and has a skew towards question-answer and statements information content; though the magnitudes in lengths and similarity are comparable.

Baseline experiments
In this section, we present our baseline experiments designed to benchmark the ACI-BENCH Corpus.These experiments encompass various note-generation tasks and incorporate state-of-the-art note-generation techniques. of note-generation techniques, we also examine the impact of different clinical doctor-patient dialogue transcript generation methods with and without human correction on the quality of automatically generated clinical notes derived from these transcripts.

Note generation models
The experiments on note-generation models to benchmark the ACI-BENCH Corpus are listed below: Transcript-copy-and-paste Previous research finds taking the longest sentence 19 as dialogue summarization is a good baseline.
In the spirit of this approach, we adopt several variations to generate the clinical note: (1) the longest speaker's turn, (2) the longest doctor's turn, (3) the first two and the last ten speaker's turns, (4) the first two and the last ten doctors turns and (5) the entire transcript.
Retrieval-based Borrowing from retrieval-based response generation 20 , we pose a simple baseline that retrieves a relevant note in the training corpus rather than generating new text.To generate a clinical note for a new transcript, we employ transcript UMLS concept set similarity to retrieve the most similar transcript from the train set.The note that corresponds to this transcript in the training set is selected as the summarization for the new transcript, based on the assumption that the semantic overlap between the UMLS concepts in the two transcripts is a reliable indicator of their content similarity.Following the same manner, we adopt a similar retrieval-based method on the document embedding similarity from the spaCy English natural language process pipeline (https://spacy.io/).

BART-based
We employ the SOTA transformer model, bidirectional autoregressive transformer (BART) 21 .We also include its two variants: (1) a version with continued pre-training on PubMed abstract 22 , aimed at learning domain-specific language and knowledge, and (2) a version fine-tuned on the SAMSum corpus 23 , designed to enhance the model's performance on conversational summarization tasks.For all BART-based models, we use the BART-Large version.It is important to note that although BART and BioBART have the same model structure, they possess distinct tokenizers and vocabulary sizes.These differences play a significant role in determining their respective performance on the ACI-BENCH corpus.The corresponding fine-tuning parameters can be found in the Appendix.BART-based models have the same limit of 1,024 tokens.

LED-based
We leverage the Longformer-Encoder-Decoder (LED) architecture 24 , which incorporates an attention mechanism that can scale up to longer sentences.LED-based models have the same limit of 16K tokens.Because the transcript is long, LED overcomes the sentence length limit from BART.We also include its variant, which is finetuned on the Pubmed dataset 25 , to enhance the model's summarization ability in the biomedical context.The corresponding fine-tuning parameters can be found in the Appendix.

OpenAI models
We experimented with the latest OpenAI models and APIs6 : (i) Text-davinci-002, (ii) Text-davinci-003, (iii) ChatGPT (gpt-3.5-turbo),and (iv) GPT-4.The first three models have the same limit of 4,097 tokens, shared between the prompt and the output/summary, whereas GPT-4 allows 32k tokens.We used the following prompt: • Prompt: "summarize the conversation to generate a clinical note with four sections: HISTORY OF PRESENT ILLNESS, PHYSICAL EXAM, RESULTS, ASSESSMENT AND PLAN.The conversation is:" To allow adequate division detection, we added some light rule-based post-processing, adding endlines before and after for each section header.This post-processing described in Appendix Table 13.

Full-note-vs division-based note-generation approaches
In the cases of the fine-tuned pre-trained models, we investigated note generation with two overall approaches: full note generation versus division-based generation and concatenation.The first approach generates a complete note from the transcript using a single model or approach.The latter approach is motivated by the long input and output lengths of our data -which may exceed that of those pre-trained models are typically trained for.To this end, full notes were divided into the SUBJECTIVE, OBJECTIVE_EXAM, OBJECTIVE_RESULTS, and ASSESSMENT_AND_PLAN divisions using a rule-based regular-expression section detection.As the notes were followed a handful of regular patterns, this section detection was highly performant.
In cases where certain sections were missing, an EMPTY flag was used as the output.Each division generation model was separately fine-tuned.The final note was created by concatenating the divisions.

Automatic Evaluation Metrics
We employ a variety of widely-used automatic evaluation metrics to evaluate performances in different perspectives.Specifically, we measure at least one lexical n-gram metric, an embedding-based similarity metric, a learned metric, and finally an information extraction metric.We evaluate the note generation performance both in the full note and in each division.
For the ngram-based lexical metric, we compute ROUGE 26 (1/2/-L), which computes unigram, bigram, and the longest common subsequence matches between reference and candidate clinical notes.For an embedding-based metric, we applied BERTScore 27 which greedily matches contextual token embeddings from pairwise cosine similarity.BERTScore efficiently captures synonym and context information.For a model-based learned metric, we used BLEURT 28 , which is trained for scoring candidate-reference similarity.Additionally, we incorporate a medical concept-based evaluation metric (MEDCON) to gauge the accuracy and consistency of clinical concepts.This metric calculates the F1-score to determine the similarity between the Unified Medical Language System (UMLS) concept sets in both candidate and reference clinical notes. 7The extraction of UMLS concepts within clinical notes is performed using a string match algorithm applied to the UMLS concept database through the QuickUMLS package 30 .To ensure clinical relevance, we restrict the MEDCON metric to specific UMLS semantic groups, designated as Anatomy, Chemicals &Drugs, Device, Disorders, Genes & Molecular Sequences, Phenomena and Physiology.To consolidate the various evaluation metrics, we first take the average of the three ROUGE submetrics as ROUGE, and the average of ROUGE, BERTScore, BLEURT, and MEDCON scores as the final evaluation score.Because BERTScore and BLEURT are limited by their pre-trained embedding length, we only use these evaluations for the division-based evaluation.

Results
We fine-tune the models on the train set and select the best trained model based on evaluation on the validation set.Performances were evaluated on three test sets.Test sets 1 and 2 correspond to the test sets from ACL ClinicalNLP MEDIQA-Chat 2023 TaskB full-note generation and TaskC dialogue generation, respectively.Test 3 corresponds to CLEF MEDIQA-SUM 2023 Subtask C full-note generation.
Our test 1 full note evaluation results can be found in Tables 6. Per-division SUBJECTIVE, OBJECTIVE_EXAM, OBJEC-TIVE_RESULTS, and ASSESSMENT_AND_PLAN results for test 1 are accounted for in Tables 7, 8, 9, and 10.In the main body of this paper, we discuss the results of test 1 which was used for our first full note generation task challenge8 .We will first provide an overview of the model performance in both full-note and division-based evaluations.We will then describe each model type's performance.For reference, we provide the results of test 2 and test 3 in the Appendix.
In the full-note evaluation, the BART+FT SAMSum (Division) model achieved the highest ROUGE scores, with 53.46 for ROUGE-1, 25.08 for ROUGE-2 and 48.62 for ROUGE-L.This is because when BART+FT SAMSum (Division) model was fine-tuned on our ACI-BENCH training set, it learned more specific clinical jargon in the ACI-BENCH corpus, such as accurate subsection headers ("CHIEF COMPLAINT", "HISTORY OF PRESENT ILLNESS", ...) and physical examination results ("-Monitoring of the heart: No murmurs, gallops ..." ).On the contrary, GPT-4 demonstrated the highest MEDCON evaluation score of 57.78, while achieving the second to third-best performance in ROUGE scores, with 51.76 for ROUGE-1, 22.58 for ROUGE-2 and 45.97 for ROUGE-L.The great performance can be attributed to the model's gigantic size, intensive pretraining, huge context size, and great versatility.GPT-4 captured many relevant clinical facts and thus had the highest MEDCON.However, since it was not specifically fine-tuned for the ACI-BENCH corpus clinical note format, it exhibited slightly inferior performance in capturing the structured ACI-BENCH clinical notes.An example of a note generated from different models can be found in the Appendix Table 14.Interestingly, the retrieval-based baselines showed very competitive ROUGE performances out-of-the-box with ROUGE-L of 40.47 F1 for and 38.20 F1 for the UMLS and sentence versions respectively.Furthermore, the simple transcript copy-and-paste baselines produced high starting points that out-performed untreated LED-based models.For example, simply copying the transcript achieved a 40.47 F1 ROUGE-L and 33.30F1 MEDCON score, whereas the fine-tuned division based LED model achieved 29.30F1 and 32.67 F1.  6. Results of the summarization models evaluated at the full note level, test set 1. Simple retrieval-based methods provided strong baselines wih better out-of-the-box performances than LED models and full-note BART models.In general for BART and LED fine-tuned models, division-based generation worked better.OpenAI models with simple prompts were shown to give competitive outputs despite no additional fine-tuning or dynamic prompting.

Model
In division-based evaluations, we found that different models achieved the highest average score across different note divisions, BART+FT SAMSum (Division) scored 51.08 in the SUBJECTIVE division (Table 7), Text-davinci-003 reached 55. 30,  48.90 and 46.19 in the OBJECTIVE_EXAM (Table 8), OBJECTIVE_RESULTS (Table 9), and ASSESSMENT_AND_PLAN (Table 10) divisions, respectively.These results indicate that all three models can be good candidates for the note-generation task.However, since BART+FT SAMSum (Division) required fine-tuning and Text-davinci-003 did not, the latter two models demonstrated greater potential.A few additional examples for Text-davinci-003 could potentially enhance their performance, by enabling the models to learn specific clinical jargon in each division.
In comparing the full-note and division-based note-generation approaches, our experiments demonstrated that, for our pretrained BART-and LED-based models, division-based note-generation methods resulted in significant improvements over full-note-generation methods.These improvements ranged from 1 to 14 point increases in both ROUGE and MEDCON evaluations for the full-note-based evaluation.This finding implies that breaking down a complex summarization problem into smaller divisions effectively captures more critical information.For division-based evaluations, the increase is not obvious for the SUBJECTIVE divisions, but around 20 percent in the average score for OBJECTIVE_EXAM, OBJECTIVE_RESULTS and ASSESSMENT_AND_PLAN divisions.This can be attributed to the generation of the latter three divisions at the end of clinical notes, which often exceeds the word length of typical summarization tasks that BART-based and LED-based models are used for.Additionally, since some notes in the training set lack these divisions, the note-generation models struggle to learn the division structure during fine-tuning from the full note.As the division of clinical notes is identified by a rule-based division header extraction method, even when the information from a specific division is generated as a few sentences, the corresponding division information cannot be detected by the evaluation program.BART and LED full note generation models suffered a significant drop at the OBJECTIVE_RESULTS division.This may be attributable to the higher sparsity of this division, low amounts of content (sometimes only 2-3 sentences), and the appearance of text later in the sequence.The OpenAI were in general better performant with BART division-based models as next best.
12/25  10. Results of the summarization models on the ASSESSMENT_AND_PLAN division, test set 1. Similar to OBJECTIVE_EXAM and OBJECTIVE_RESULTS, BART and LED full note generation models suffered a significant drop at the OBJECTIVE_RESULTS division.This may be attributable to the appearance of text later in the sequence.The OpenAI were in general better performant with BART division-based models as next best.

13/25
Our observations on the performance for each type of model are summarized below: Transcript-copy-and-paste models are only evaluated in the full note.It demonstrated suboptimal performance, which is around 17 points less than the best ROUGE scores.This is primarily because transcripts from doctor-patient dialogues serve to facilitate doctor-patient interactions with questions, answers, and explanations related to various health phenomena.In contrast, clinical notes, which are created by and intended for healthcare professionals, generally follow the SOAP format to convey the information concisely and accurately.Therefore, transcripts and notes can differ significantly in terms of terminology, degree of formality, relevance to clinical issues, and the organization of clinical concepts.On the other hand, the original transcript often achieves the fourth highest score in MEDCON evaluation at 55.65, owing to its ability to capture relevant UMLS concepts explicitly mentioned within the transcript.
Retrieval-based models have the best BERTScore in OBJECTIVE_EXAM, OBJECTIVE_RESULTS and ASSESSMENT_AND_PLAN divisions in test 1, with around 1 to 5 points increase over the best BART-based and OpenAI models.They also have shown sometimes promising results with the second and first average scores in OBJECTIVE_RESULTS and ASSESS-MENT_AND_PLAN divisions from test 3.This is because clinical notes with similar transcripts tend to have more similar clinical notes, especially when OBJECTIVE_RESULTS sections use standard phrasing and templates, or in scenarios where patients share common symptoms and health examinations from different medical problems.However, their performance in the MEDCON evaluation metric is often poor because of the less accurate patient-specific medical conditions.As a result, these models may perform well in non-MEDCON evaluation metrics but may not produce accurate MEDCON evaluations.
BART-based models demonstrated superior performance.In full-note evaluation, BART+FT SAMSum (Division) had the best ROUGE score performance with MEDCON evaluation scores only secondary to OpenAI models.In SUBJECTIVE division, BART+FT SAMSum (Division) had top performance in all scores except BLEURT.These findings suggest that using a model fine-tuned on a similar dataset serves as a solid foundation for summarization tasks.Meanwhile, BioBART exhibits a comparatively weaker performance than BART, which could be attributed to the choice of vocabularies, tokenizers, and consequently, the quality of contextual embeddings.For BART-based models, the division-based note-generation approach improved the performance from the full-note-generation approach with around a 5 to 40 points increase in all division-based average scores.This implies that dividing the complex note-generation tasks into simpler subtasks can boost model performance.
LED-based models were generally inferior to that of BART-based models with around 15 to 40 points lower scores in full-note ROUGE and MEDCON scores.We observed that compared with the BART-based models, the LED-based models generate notes with worse fluency, less essential clinical information, and poorer division structure.On the other hand, the effect of the division method on LED-based models was similar to that on BART-based models, which lead to a 1 to 9 points increase in full-note ROUGE and MEDCON scores and a 2 to 25 points increase in division-based average scores.
OpenAI models exhibited good general performance and using a generic prompt, without fine-tuning.GPT4 outperformed other OpenAI models, at around 10 ROUGE-1 F1 points in full-note evaluation.This is consistent as GPT4 is known to have been trained with more parameters and has had shown to have made impressive performances across a variety of human tasks 31,32 .While Text-davinci-003 and ChatGPT were within 4 ROUGE-1 points in test 1, there were larger 4-9 point gaps in test 2 and 3 respectively.This information combined with the relatively stable ROUGE-1 score for GPT4 (at around ∼50 Rouge-1), suggests that the earlier models had more unstable performances.Assessing the division-based performances, we see the relative ranking of the OpenAI's were more variable (with the exception of Text-davinci-002 consistently performing below the other models).

Effect of ASR vs human transcription and correction
In practice, automatic speech recognition (ASR) is widely deployed because it provides an affordable, real-time text-based transcript.However, the quality of ASR is usually worse than the human transcript, influenced by its model type, hardware, and training corpus.To study the effect of ASR vs human transcription on clinical note generation from dialogue, we evaluate the note-general model performance on transcripts generated from these two approaches.We compare the performance between human transcription versus ASR for the virtscribe subset of the data; and ASR versus ASR-corrected in the aci subet.We conduct this ablation study with one of the best models in the previous section, BART+FT SAMSum (Division), and compare the result on the split of the three test sets.
To study the difference with human transcription versus ASR for the virtscribe subset, we experiment with feeding the raw ASR transcripts instead of it's original human transcription.We also fine-tune the model further to adapt to the ASR version by additionally learning for an additional 3 epochs with the same parameters using the ASR version of the virtscribe train set.
To understand effects of the train/decode discrepancies, we evaluate the results of feeding in the original human-transcription source as well as ASR versions to both the original fine-tuned model and the further ASR-fined tuned model.The results of the virtscribe source experiments are presented in Table 11.We observed that the model setup with best ROUGE-1/2/L and MEDCON scores are different for each test set.Namely, BART+FT SAMSum (Division) with transcripts generated by ASR from virtscribe dialogues do not exhibit outstanding differences with the human transcription (when using ASR source input performance dropped to 41.74 F1 ROUGE-L instead of 43.98 ROUGLE-L with the original model human transcript input for test1).Further fine-tuning BART+FT SAMSum (Division) with ASR notes in the train set also did not greatly improve the performance (fine-tuning improved 2 points in F1 to 43.82 F1 for an ASR transcript source, and a minimal drop to 43.59 when applying the original human transcription source).This indicates that ASR and the human transcript do not have a remarkable impact on the note-generation performance from dialogue with virtscribe.To study the effect of ASR versus ASR-corrected, we conduct similar experiments for the aci subset by substituting the original ASR transcripts with corrected versions.The results of these experiments are shown in Table 12, we also observed that the model setup with best ROUGE-1/2/L and MEDCON scores are different for each test set.The ASR-corrected did not exhibit more outstanding improvement from the original ASR on the BART+FT SAMSum (Division)'s note generation performance (with approximately 1 F1 point difference amongst all the test sets and evaluation versions and metrics).Further fine-tuning BART+FT SAMSum (Division) with ASRcorr notes in the train set also did not substantially change performance.This indicates that those ASR errors corrected by humans do not have a remarkable impact on the note generation performance.
In summary, our investigation of ASR versus human transcription shows that although ASR can generate errors in the transcript, those errors do not have a remarkable impact on the note-generation performance and are thus tolerable by our current model setting.However, this could be due to our automatic evaluation metrics evaluating the n-grams and clinical facts with uniform weights.In clinical practice, some particular medical fact errors from the ASR can have a non-trivial impact.

Usage Notes
We have provided instructions in the README file in the Figshare repository describing how to process the ACI-BENCH dataset.Examples of processing the data for different summarization evaluations can be found in the code located at the GitHub repository provided below.

Limitations
There are several limitations to this work.The data here is small and produced synthetically by medical annotators or patient actors in a single institution.Therefore, this dataset may not cover in a statistically representative way, all health topics, speech variations, and note format variations present in the real word.
The data here is intended to be used for benchmarking methods related to clinician-patient dialogue summarization.It should not be used for training models to make medical diagnosis.'RESULTS ' , 'ASSESSMENT AND PLAN' ] : t e x t = t e x t .r e p l a c e ( '%s ' %d i v i s i o n , ' \ n%s \ n ' %d i v i s i o n ) t e x t = t e x t .r e p l a c e ( '<%s > ' \% d i v i s i o n , ' \ n%s \ n ' %d i v i s i o n ) t e x t = t e x t .r e p l a c e ( '% s : ' \ % d i v i s i o n , ' \ n%s \ n ' %d i v i s i o n ) t e x t = t e x t .r e p l a c e ( '# %s : ' \ % d i v i s i o n , ' \ n%s \ n ' %d i v i s i o n ) t e x t = t e x t .r e p l a c e ( '# %s '\% d i v i s i o n , ' \ n%s \ n ' %d i v i s i o n ) t e x t = t e x t .r e p l a c e ( '## %s '\% d i v i s i o n , ' \ n%s \ n ' %d i v i s i o n ) t e x t = t e x t .r e p l a c e ( ' ** % s ** '\% d i v i s i o n , ' \ n%s \ n ' %d i v i s i o n ) Text-Davinci-002,Text-Davinci-003 t e x t .r e p l a c e ( ' \ n ' , ' ' ) .s t r i p ( ) .r e p l a c e ( 'PHYSICAL EXAM: ' , ' \ nPHYSICAL EXAM : \ n ' ) . r e p l a c e ( 'RESULTS : ' , ' \ nRESULTS : \ n ' ) .r e p l a c e ( 'ASSESSMENT AND PLAN : ' , ' \ nASSESSMENT AND PLAN : \ n ' ) GPT-4 t e x t = t e x t .r e p l a c e ( ' \ n ' , ' ' ) .s t r i p ( ) .i f ( t e x t .s t a r t s w i t h ( " P o s s i b l e summary : " ) o r t e x t .s t a r t s w i t h ( " P o s s i b l e c l i n i c a l n o t e : " ) o r t e x t .s t a r t s w i t h ( "A p o s s i b l e c l i n i c a l n o t e i s : " ) ) : t e x t = t e x t [ t e x t .i n d e x ( " : " ) + 1 : ] t e x t .r e p l a c e ( 'PHYSICAL EXAM: ' , ' \ nPHYSICAL EXAM : \ n ' ) . r e p l a c e ( 'RESULTS : ' , ' \ nRESULTS : \ n ' ) .r e p l a c e ( 'ASSESSMENT AND PLAN : ' , ' \ nASSESSMENT AND PLAN : \ n ' ) Table 13.Open-AI post-processing rules.In order to ensure the rule-based section algorithm may correctly split into divisions, we added several simple post-processing rules tailored to the algorithm.

Sample model output
We investigated example outputs from different models, to generate notes from the transcript D2N080 in the validation set.As demonstrated in Table 14, both BART+FT SAMSum (Division) and GPT-4 excelled at condensing dialogue information into a coherent clinical note.However, among all the models, only GPT-4 properly identified the patient's correction on the doctor's [patient] and it just happens occasionally less than once a day when i'm <walking> all of a sudden it is kind of <like> gives out and i <think> here i'm going to <fall> but i usually <catch> myself so <lot> of times i have to hold a grocery cart and that helps a <lot> so it comes and goes and it it passes just about as quickly as it comes i do n't know what it is whether i stepped wrong or i just don't know... <CHIEF COMPLAINT> <Left knee pain>.<HISTORY OF PRESENT ILLNESS> Andrea Barnes is a 34-year-old <female> who presents today for evaluation of <left knee pain>.The patient has been experiencing intermittent episodes of <pain> and sudden instability with ambulation.Her <pain> is localized deep in her <patella> and occurs less than once daily...  mistake in the transcript from "right knee pain" to "left knee pain".Meanwhile, BART+FT SAMSum (Division) only missed crucial pain-related information and instead focused on less important details about the patient's travel.

Baseline hyper-parameters
The fine-tuning and note generation hyper-parameters for BART-and LED-based baseline models can be found in Table 15.
Note that the max target token length is smaller than the total length of clinical notes.Because the BART-and LED-based baseline models are not initially pretrained with such a long token length as clinical notes, a longer max target token length does not have a very good generation result from our experiment.

Figure 1 .
Figure1.Note division example.The same content in a clinical note can appear under different sections.As an example, in the left note, "past medical history" contents are written in the "history" portion of the note on the right.To seperate the full note target into smaller text and minimize data sparsity problems if modeling by individual sections, notes are partitioned into separate SUBJECTIVE, OBJECTIVE_EXAM, OBJECTIVE_RESULTS, and ASSESSMENT_AND_PLAN continuous divisions.This also allows evaluation and generation at a higher granularity compared to a full note level.

Table 2 .
Examples of unsupported text (demarked by square brackets).

Table 4 .
Data statistic comparing notes from aci-validation with a sample of real doctor-patient.

Table 5 .
Alignment statistic comparison of aci-validation with a sample of real doctor-patient.

Table 7 .
Results of the summarization models on the SUBJECTIVE division, test set 1. BART-based models generated at both full note and division levels had similar levels of performances, which were in general better than the other model classes.As in the full note evaluation, retrieval-based methods provided competitive baselines.

Table 8 .
Results of the summarization models on the OBJECTIVE_EXAM division, test set 1. BART and LED full note generation models suffered a significant drop at the OBJECTIVE_EXAM.This may be attributable to the lower amounts of content required to be generated, the appearance of text later in the sequence, as well as the higher variety of structures.The OpenAI were in general better performant with BART division-based models as next best.

Table 9 .
Results of the summarization models on the OBJECTIVE_RESULTS division, test set 1. Similar to OBJECTIVE_EXAM,

Table 11 .
Model performance on different test sets splits, comparison between virtscribe dialogues with ASR and human transcript.The model finetuned on the train set is the BART+FT SAMSum (Division) fine-tuned with 10 epochs on the original train set, as in the baseline methods.The train + train ASR model refers to the BART+FT SAMSum (Division) finetuned for 3 more epochs on the virtscribe with ASR split of the train set.

Table 12 .
Model performance on different test sets splits, comparison between aci dialogues with ASR and ASRcorr transcript.The model finetuned on the train set is the BART+FT SAMSum (Division) fine-tuned with 10 epochs on the original train set, as in the baseline methods.The train + train ASRcorr model refers to the BART+FT SAMSum (Division) finetuned for 3 more epochs on the aci with ASRcorr split of the train set.No patient data was used or disclosed here.Names of the original actors were changed.The gender balance of the entire dataset is roughly equal.Other demographic information was not modeled in this dataset.All code used to run data statistics, baseline models, and evaluation to analyze the ACI-BENCH corpus is freely available at [LINK TO BE UPDATED].31.Bubeck, S. et al.Sparks of artificial general intelligence: Early experiments with GPT-4.In Sparks of Artificial General Intelligence: Early experiments with GPT-4 (2023).32.OpenAI.GPT-4 technical report.Publ.arXivVersion Number: 3 10.48550/ARXIV.2303.08774(2023).WY developed and created the annotation guidelines, supervised the annotation work, advised on baseline experiments, performed corpus data analysis, and drafted the original manuscript.YF performed baseline experiments, analysis of model performance, and manuscript authorship.AB advised on guideline creation, annotation work, ran baselines, and reviewed and revised the manuscript.NS participated in acquisition of the source data, as well as advised on guideline creation, annotation work, and manuscript review.TL advised on baseline experiments, and reviewed and revised the manuscript.MY advised on the guideline creation, annotation work and baseline experiments, and reviewed and revised the manuscript.
i <understand> you're you've come in with some <right knee pain> can you tell me about it what's going on [patient] it it's not the <right knee> it's the <left knee> [doctor] okay the <left knee> train UMLS BART+FT SAMSum BART+FT SAMSum (Division) <CHIEF COMPLAINT> Annual exam.<HISTORY OF PRESENT ILLNESS> Martha Collins is a 50-year-old <female> with a <past medical history> significant for <congestive heart failure>, <depression>,and <hypertension> who presents her annual exam.It has been a year since I last <saw> the patient... <CHIEF COMPLAINT> <Right knee pain>.<HISTORY OF PRESENT ILLNESS> The patient reports she has <right knee pain>, which she <experiences> occasionally less than once a day when she is <walking>.She does not know the cause of the <pain>.She denies any specific <injury> to her <knee>... CHIEF COMPLAINT> <Right knee pain>.<SOCIAL HISTORY> <Patient> reports she <likes> to travel.She has a <trip> planned for the next month in which she will <probably> be doing an <lot> of <walking>.She is also <planning> to attend a lion's club <function> in which there are line dances.She does not <feel> her <knee> is in good <condition> for this type of activity She denies <hearing> or <feeling> a pop at any point... CHKnee <pain> is a common <complaint> in the <emergency> room today.The patient reports that it occurs less frequently than previously thought and she does not know if her <symptoms> are caused by <walking> on <unstable> surfaces or due to an underlying <inflammatory disease> such as <diabetes mellitus>, <hypercholesterolemia> (hcm)... CHHISTORY OF <PRESENT> ILLNESS Andrea <Diaz> is a pleasant 29-year old <female> who presents to the clinic today for evaluation of <right knee pain>.The onset and severity are <unknown> but they usually resolve themselves quickly on their own without treatment or therapy... <HISTORY OF PRESENT ILLNESS>: The patient is a 65-year-old <female> who presents with intermittent <left knee pain> that occurs less than once a day when <walking>.She describes it as a <feeling> of the <knee> giving out and the kneecap fading.She denies any <trauma>, pop, or <swelling>.She has not taken any <analgesics>.She is concerned about her upcoming <trip> that involves <walking> and line <dancing>...

Table 14 .
Example outputs from different models, to generate notes from the transcript D2N080 in the validation set (reformatted).The UMLS concepts detected in fact-based evaluation are included inside angle brackets.

Table 15 .
The fine-tuning and note generation hyper-parameters for BART-and LED-based baseline models.