Medical image captioning via generative pretrained transformers

The proposed model for automatic clinical image caption generation combines the analysis of radiological scans with structured patient information from the textual records. It uses two language models, the Show-Attend-Tell and the GPT-3, to generate comprehensive and descriptive radiology records. The generated textual summary contains essential information about pathologies found, their location, along with the 2D heatmaps that localize each pathology on the scans. The model has been tested on two medical datasets, the Open-I, MIMIC-CXR, and the general-purpose MS-COCO, and the results measured with natural language assessment metrics demonstrated its efficient applicability to chest X-ray image captioning.


Introduction
Medical imaging is indispensable in the current diagnostic workflows.Out of the plethora of existing imaging modalities, X-Ray remains one of the most widely-used visualization methods in many hospitals around the world, because it is inexpensive and easily accessible 1 .Analyzing and interpreting X-ray images is especially crucial for diagnosing and monitoring a wide range of lung diseases, including pneumonia 2 , pneumothorax 3 , and COVID-19 complications 4 .
Today, the generation of a free-text description based on clinical radiography results has become a convenient tool in clinical practice 5 .Having to study approximately 100 X-Rays daily 5 , radiologists are overloaded by the necessity to report their observations in writing, a tedious and time-consuming task that requires a deep domain-specific knowledge.The typical manual annotation overload can lead to several problems, such as missed findings, inconsistent quantification, and delay of a patient's stay in the hospital, which brings increased costs for the treatment.Among all, the qualification of radiologists as far as the correct diagnosis establishing should be stated as major problems.
In the COVID-19 era, there is a higher need for robust image captioning 5 framework.Thus, many healthcare systems outsource the medical image analysis task.Automatic generation of chest X-Ray medical reports using deep learning can assist and accelerate the diagnosis establishing process followed by clinicians.Providing automated support for this task has the potential to ease clinical workflows and improve both care quality and standardization.We propose to apply a model that works perfectly on non-medical data, to the medical data.

Medical background
Radiology is the medical discipline that uses medical imaging to diagnose and treat diseases.Today, radiology actively implements new artificial intelligence approaches 6 .There are three types of radiologists -diagnostic radiologists, interventional radiologists and radiation oncologists.They all use medical imaging procedures such as X-Rays, computed tomography (CT), magnetic resonance imaging (MRI), nuclear medicine, positron emission tomography (PET) and ultrasound.Diagnostic radiologists interpret and report on images resulted from imaging procedures, diagnose the cause of patient's symptoms, recommend treatment and offer additional clinical tests.They specialize on different parts of human body -breast imaging (mammograms), cardiovascular radiology (heart and circulatory system), chest radiology (heart and lungs), gastrointestinal radiology (stomach, intestines and abdomen), etc. Interventional radiologists use radiology images to perform clinical procedures with minimally invasive techniques.They are often involved in treating cancers or tumors, heart diseases, strokes, • Finally, we contribute into deep learning community with two language models trained on large MIMIC-CXR dataset The rest of the paper is organized as follows: section 2 describes two language models architecture separately, section 3 provides the description of the proposed approach, section 4 describes datasets used and computing power utilized, subsection 5.1 and subsection 5.2 present, compare the results, while section 6 introduces the results and conclusions of the paper.

Show Attend and Tell
Show Attend and Tell (SAT) 11 is an attention-based image caption generation neural net.Attention-based technique allows to get well interpretable results, which can be utilized by radiologist to ensure their findings on X-Ray.Including attention, the module gives the advantage to visualize where exactly the model 'sees' the specific pathology.SAT consists of three blocks: Encoder, Attention module and Decoder.It takes an image, encodes it, attends each part of the image, and generates a L-length caption z, an encoded sequence of words from W -length vocabulary: (1)

Encoder
Encoder is a convolutional neural network (CNN).It encodes an image and outputs a set of C vectors, each of which is a D-dimensional representation of the image corresponding part: Here C represents the number of channels in the output of the encoder.It depends on the used type of the encoder: 1024 for DenseNet-121 36 , 512 for VGG-16 37 , 2048 for InceptionV3 38 and ResNet-101 39 .D is a configurable parameter representing the encoded vectors size.Features are extracted from the lower convolutional layer prior to the fully connected layers, and are being passed through the Adaptive Average Pooling layer.This allows the decoder to selectively focus on certain parts of an image by selecting a subset of all the feature vectors.

Decoder with attention module
The decoder is implemented as a LSTM neural network 40 .It produces a caption by generating one word at every time step conditioned by the attention (context) vector, the previous hidden state and the previously generated words.The LSTM can be represented as the following set of equations: Vectors i t , f t , c t , o t , h t represent the input/update gate activation vector, forgetting gate activation vector, memory or cell state vector, while outputting gate activation vector and hidden state of the LSTM respectively.T s,t is an affine transformation, such that R s → R t with non-zero bias.m denotes the embedding dimension, while n represents LSTM dimension.σ and stand for the sigmoid activation function and element-wise multiplication, respectively.E ∈ R m×L is an embedding matrix.The vector â ∈ R D holds the visual information from a particular input location of the image at time t.Thus, â called context vector.
Attention is a function φ , that computes context vector ât from the encoded vectors a i (2), produced by the encoder.The attention module generates a positive number α i for each location i on the image.This number can be interpreted as the relative importance to give to the location i, among others.Attention module realized as a multi-layer perceptron (MLP) with a softmax activation function, conditioned at the previous hidden state h t−1 (5) of the LSTM.The attention module is depicted in Figure 1.
Set of linear layers in MLP is denoted as a function f att .The weights α ti are computed with the help of the following equations: The sum of weights α ti (7) should be equal to 1 ∑ C i=1 α ti = 1.The context vector ât is computed by the attention function φ with the set of encoded vectors a (2) and their corresponding weights α ti (7) as inputs: ât = φ ({a i } , {α ti }).According to the  original paper function, φ can be either 'soft' or 'hard' attention.Due to specific task of medical image caption, function φ was chosen to be the 'soft' attention, as it allows model to focus more on some specific parts of X-Rays from others and to detect pathologies and major organs such as heart, lung etc.It is named as a 'deterministic soft attention' and recognized as a weighted sum : φ ({a i } , {α ti }) = ∑ C i α i a i .Hence, context vector can be computed as: The initial memory state and hidden state of the LSTM are initialized with two separate multi-layer perceptrons (init-c and init-h) with the encoded vectors a i (2) for a faster convergence: To compute the output of LSTM representing a probabilities vector the next word, a 'deep output layer' 40 was used.It looks both on the LSTM state h t (5), on context vector ât (8) and the one previous word z t−1 (2): where , and E ∈ R m×L represent the embedding matrix.
The authors in 11 suggest to use the 'doubly stochastic attention', where ∑ t α ti ≈ 1.This can be interpreted as encouraging the model to pay equal attention to every part of the image.Yet, this method is not relevant for X-Rays, as each part of the chest is almost at the same position from image to image.If the model learned, e.g., that heart is in its specific position, a model doesn't have to search for the heart somewhere else.The model is trained in an end-to-end manner by minimizing the cross-entropy loss L CE between vector with a softmaxed distribution probability of next word and true caption as L CE = − log(P(z|a)).

Generative Pretrained Transformer
Generative Pretrained Transformer (GPT-3) 41 is a large transformer-based language model with 1.75 × 10 11 parameters, trained on 570 GB of text.GPT-3 can be used to generate realistic continuations texts from the arbitrary domain.Basically, GPT-3 is a transformer that can look at a part of the sentence and predict the next word, thus being a language model.The original transformer 42 is made up of encoder stack and decoder stack, in which encoders and decoders stacked upon each other.Whereas GPT-3 is built using just decoder blocks.One decoder block consists of Masked Self-Attention layer and Feed-Forward neural network.It is called Masked as it pays attention only to previous inputs.The input should be encoded prior to going into the decoder block.In transformers and in the GPT-3 particularly, there are two subsequent encodings: Byte Pair Token Encoding and Positional Encoding.Byte Pair Encoding (BPE) is a simple data compression technique that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte.The algorithm compresses data by finding the most frequently occurring pairs of adjacent subtokens in the data and replacing all instances of the pair with a single subword.The algorithm repeats this process until no further compression is possible.Such tokenization avoids adding a special <unk> token to the vocabulary, as now all words can be encoded and obtained by combination of subwords from the vocabulary.

4/13 3 Proposed Architecture
We introduce two architectures for X-Ray image captioning.The overall goal of our approach is to improve the quality of Encoder-Decoder generated clinical records by using the GPT-3 language model.The suggested model consists of two parts: the Encoder, Decoder (LSTM) with an attention module and the GPT-3.While the Encoder with LSTM detects pathologies and indicates zones of higher attention demand, the GPT-3 takes it as input and writes a comprehensive medical report.
There are two possible approaches for this task.The first one consists in forcing models to learn joint word distribution.Within this method (Fig. 2), both models A and B output scores for the next word in a sentence.Afterwards, due to concatenating these scores and pushing them through the feed-forward neural net C, we get the final scores for upcoming word.Whilst the disadvantage of this approach is the following: the GPT-3 model has its own vocabulary built by the Byte Pair Tokenizer.This vocabulary is different from the one used by the Show Attend and Tell.We need to take from continuous GPT-3 distribution separate scores corresponding to the words present in the Show Attend and Tell vocabulary.This turns continuous distribution from the GPT-3 into discrete and hence, while we don't use all the potential generation power from the GPT-3.The second method shown in Fig. 3 consists in fine-tuning both models on the MIMIC-CXR dataset and using them one after another.Show Attend and Tell A gets an image as an input and generates a report based on the data found on X-Ray with an Attention module.It learns where to focus and gives a seed for the GPT-3 B to continue generating text.The GPT-3 was fine-tuned on MIMIC-CXR in self-supervised manner using the Huggingface framework 43 .It learns to predict the next word in the text.The GPT-3 continues the report outputed by SAT and generates a detailed and complete clinical report based on pathologies found by SAT.Such an approach is better for the GPT-3 as it gets more context as input (from SAT) than in the first approach.Thus, the second approach performs better, and was hence chosen by the authors of this paper as the main architecture.

First Language Model
The first part of the suggested model is realized as the Show Attend and Tell model (SAT), the encoder, to encode the image and the LSTM for decoding into sequence.The encoder encodes the input image with 3 or 1 color channels into a smaller image with 'learned' channels.The resulted encoded images can be interpreted as a summary representation of findings in the X-Ray (Eq.2).Those encoders pretrained on the ImageNet 44 are not suitable for the medical image caption task, as chest X-Rays doesn't have objects, figures from everyday life.Thus, the DenseNet-121 from 45 pretrained on the MIMIC-CXR dataset was taken.It was trained for the classification task on 18 labels : Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural Thickening, Cardiomegaly, Nodule, Mass, Hernia, Lung Lesion, Fracture, Lung Opacity, and Enlarged Cardiomediastinum.Hence, the last classification layer was removed and features from the last 5/13 convolutional layer were taken.These features were passed through the Adaptive Average Pooling layer.As a result, the image encoded parts were obtained.They can be represented by the tensor with the following dimensions: (batchsize ×C, D, D) (Eq.2).C stands for the number of channels or how many different image regions to consider.D implies the dimension of the image encoded region.Furthermore, the fine-tune method for encoder was added.It enables or disables the calculation of gradients for the encoder's parameters through the last layers.Then, at every time step, the decoder with the attention module observes the encoded small images with findings and generates a caption word by word.The Encoder output is received and flattened to dimensions (batchsize,C, D × D).Since captions are padded with special <pad> token, captions are sorted by decreasing lengths and at every time-step of generating a word, an effective batch size is computed in order not to process the <pad> token.
The Show Attend and Tell model was trained using the Teacher-Forcing method while at each step the input to the model was the ground truth word on this step and not the previous generated word.As a result, we can consider the SAT as a language model A. It gets a tokenized text of length m, an image as input and outputs a vector of probabilities for the next word at each time step t: where W is the SAT vocabulary size and L is the length of generated report (Eq.1).Where P 1 is computed as it is shown in the Eq.11.
Over the training process the LSTM outputs a word with a maximum probability after the softmax layer.It is a greedy approach, yet there is also an option to use the K-Beam search.Authors of 46 used the K-Beam during training, however this is not a common approach.In our experiments, the greedy approach was used within the training process, and we applied the K-Beam search over the inference stage. 6/13

Second Language Model
The second part of the architecture proposed is the GPT-3 being a language model.The GPT-3 is built from decoder blocks using the transformer architecture.At the same time, the decoder block consists of masked self-attention and feed-forward neural network (FFNN).The output yields the token probabilities, i.e., logits.The GPT-3 was pretrained separately on the MIMIC-CXR dataset and was then fine-tuned together with the SAT to enhance clinical reports.
We put a special token <start> at the end of the text generated by the SAT allowing the GPT-3 to understand where to start the generation process.We also used the K-Beam search after the GPT-3 generation and took the second best sentence from the output as a continuation.The pretrained GPT-3 performs as a separate language model B and generates good records based on the input text or tags.The GPT-3 generates report till the moment when it generates the special token <|endoftext|> token.We denote the length of the GPT-3 generated text as l

Combination of two language models
We use a combination of two models placing them sequentially: the SAT model extracts visual features from the image and allows us to focus on its specific parts.The GPT-3 provides good and comprehensive text, based on what is found by the first model.Thus, the predictions from the first model improve those of the second language model.

Evaluation metrics
The common evaluation metrics used for image captioning are : bilingual evaluation understudy (BLEU) 47 , recall-oriented understudy for gisting evaluation (ROUGE) 48 , metric for evaluation of translation with explicit ordering (METEOR) 49 , consensusbased image description evaluation (CIDEr) 50 , and semantic propositional image caption evaluation (SPICE) 51 .The Microsoft Common Objects in Context 52 provides the kit with implementation of these metrics for the image caption task.

Datasets
For training and evaluation of medical image captioning, we use three publicly available datasets.Two of them are medical images datasets and the third one is a general-purpose one.

MIMIC-CXR
The MIMIC Chest X-Ray (MIMIC-CXR) 53 dataset is a large publicly available dataset of chest radiographs in DICOM format with free-text radiology reports.This dataset consists of 377,110 images corresponding to 227835 radiographic studies performed at the Beth Israel Deaconess Medical Center in Boston, MA.
Open-I The Indiana University Chest X-Ray Collection (IU X-Ray) 20 contains radiology reports associated with X-Ray images.This dataset contains 7470 image-report pairs.All the reports enclose the following sections: impression, findings, tags, comparison, and indication.We use the concatenation of impression and findings as the target captions MSCOCO Microsoft Common Objects in Context dataset (MS COCO dataset) 54 is large-scale non-medical dataset for scene understanding.The dataset is commonly used for training and benchmark object detection, segmentation, and captioning algorithms.

Image preprocessing
Hierarchical Data Format (HDF5) 55 dataset was used to store all images.X-Rays are in gray-scale and have one channel.
To process them with the pre-trained CNN DenseNet-121, we used 1 channel image.Each image was resized to the size of 224×224 pixels, normalized to the range from 0 to 1, and converted to the float32 type and stored in the HDF5 dataset.

Image captions pre-processing
Following the logic in 56 , a medical report is considered as a concatenation of Impression and Findings sections, if both of these sections is empty then this report was excluded.This resulted in 360666 DICOMs with reports for MIMIC-CXR dataset.
The text records are pre-processed by converting all tokens to lowercase, removing all non-alphanumerical tokens.For our experiments we used 75% of data for training, 24,75 % for validation and 0.25% for testing.The MIMIC-CXR database was used to access metadata and labels derived from free-text radiology reports.These labels were extracted using NegBio tool 17,25 that outputs one of 14 pathologies along with their severity and (or) absence.To generate more accurate reports, we added the extracted labels to the beginning of the report.This allows language models to know the summary of the report for a more precise description generation.
We additionally formed the abbreviations dictionary of 150+ words from the Unified Medical Language System (UMLS) 57 .We also extended our dictionary size with several commonly used medical terms from the Medical Concept Annotation Tool 58 .

Training of the Neural Network
The pipeline is implemented using PyTorch.Experiments were conducted on a server running the Ubuntu 16.04 (32 GB RAM).All models were trained with NVIDIA Tesla V100 GPU (32 GB RAM).In all experiments, we use a 5-fold cross-validation and reported the mean performance.The SAT was trained for 70 epochs with batch size of 16, embedding dimension of 100, attention and decoder dimension of 512, dropout value 0.1.The encoder and decoder learning rates were 4 × 10 −7 and 3 × 10 −7 , respectively.The Cross Entropy loss was used for training.The best model is chosen according to the highest geometric mean of BLEU-n, as it is done in other works 59 .SAT was trained in Teacher-Forcing technique, while the Greedy approach is used for counting metrics.The GPT-3 small was fine-tuned with the MIMIC-CXR dataset for 30 epochs with batch size of 4, learning rate of 5 × 10 −5 , the Adam epsilon of 1 × 10 −8 , where the block size equals 1024, with clipping gradients, which are bigger than 1.0.It was fine-tuned in a self-supervised manner as a language model.No data augmentation was applied.

Quantitative results
The quantitative results for the baseline models, preceding works and our models are presented in Table 1.Models were evaluated on the most common Open-I dataset as well as on the big and rarely reported MIMIC-CXR dataset with free-text radiology reports.We implemented the most commonly used metrics for evaluation -BLEU-n, CIDEr and ROUGE.The proposed approach outperforms the existing models in terms of the NLG metrics -BLEU-n, CIDEr and ROUGE.
We additionally provided our model performance illustrations in Table 2 containing the original X-Ray images from the MIMIC-CXR dataset, the ground truth expert label and the model prediction (SAT + GPT-3).We manually underlined the similarities and identical diagnoses in texts to guide the eye.

Discussion
The first language model (SAT) learned to generate short summary at the beginning of the report, based on findings from the X-Ray to provide the finding details.This offers text generation direction seed for the second model.Performed preprocessing of medical reports allowed to get these high metrics.We also address the biased data problem by applying domain-specific text preprocessing while using the NegBio labeller.In a radiology database, the data is unbalanced because abnormal cases are rarer than the normal ones.The NegBio labeller allowed us to get a not negative-biased diagnosis clinical records as it added short sentences at the beginning of ground truth report, making this task closer (in some ways) to classification task, when the state-of-the-art models had already managed to achieve strong performance.The SAT also provides 2D heatmaps of pathologies localization, assisting and accelerating the diagnosis process followed by clinicians.
The second language model, the Generative Pretrained Transformer (GPT-3), showed promising results in the medical domain.It successfully continued texts from the first language model, taking into consideration all the findings provided.As GPT-3 is a large and smart transformer, it summarizes and provides more details on findings.Natural language generation metrics suggest using two language models subsequently.Such an approach can be considered as strong for the text generation.
The SAT followed by the GPT-3 outperformed the reported state-of-the-art (SOTA) models in all the 3 datasets considered.Notably, the proposed approach beats SOTA models on MIMIC-CXR demonstrating the highest performance in all the metrics measured.The performance for the main evaluation dataset, the Open-I, is also measured by the F1-score using micro-averaging and demonstrates 0.861 vs. 0.840 for the proposed (SAT + GPT-3) model and the SAT, respectively.
Examples of the reports generated jointly via the Show-Attend-Tell + GPT-3 architecture, are shown in Table 2.One may notice that some generated sentences are identical with the ground truth.For example, in both generated and true reports for the first X-Ray is "no acute cardiopulmonary abnormality".Some sentences close in their meaning, even, even if they are different in terms of chosen words and n-grams ("no pneumonia.no pleural effusion.no edema...." compared to " without pulmonary edema or pneumothorax").

Conclusions
The authors of the current paper introduced a new technique of combining two language models for the medical image captioning task.Principally, the new preprocessing and squeezing approaches for clinical records were implemented along with a combined language model, where the first component is based on attention mechanism and the second represents a generative pretrained transformer.The proposed combination of models generates a descriptive textual summary with essential   information on found pathologies along with their location and severity.Besides, the 2D heatmaps localize each pathology on the original X-Ray scans.The results measured with the natural language generation metrics on both the MIMIC-CXR and the Open-I datasets speak for an efficient applicability to the chest X-Ray image captioning task.This approach also provides well-interpretable results and allows to support medical decision making.
We investigated various approaches to the text from the angle of generation automatic X-Ray captioning.We proved that the Show-Attend-Tell is a strong baseline outperforming models with Transformer-based decoders.With the help of the GPT-3 pre-trained language model, we managed to improve this baseline.The simple method, whither the GPT-3 model finishes report started by the Show-Attend-Tell model, yields significant improvements of the standard text generation scores.

Figure 1 .
Figure 1.Attention module used in SAT

Figure 2 .
Figure 2. The first approach.Learn the joint distribution of two models.The drawback is in sampling from the GPT-3 distribution.
surgical clips are seen in right upper quadrant of abdomen.aortic arch calcifications are noted.Compared to prior chest radiographs through .Previous mild pulmonary edema has improved, moderate cardiomegaly and mediastinal vascular engorgement have not.ET tube, right transjugular temporary pacer lead are in standard placements and an esophageal drainage tube passes into the stomach and out of view.Pleural effusions are presumed but not substantial.No pneumothorax.supportdevices present.no pneumothorax.pleural effusion present.lung opacity present.uncertain enlarged cardiomediastinum.no edema.atelectasis present.right internal jugular central line has its tip in distal superior vena cava.overall cardiac and mediastinal contours are likely stable given patient rotation on current study.lung volumes remain low with patchy opacities at both bases likely reflecting atelectasis.blunting of both costophrenic angles may reflect small effusions.

Table 1 .
Reported mean performance using word-overlap metrics for two medical radiology datasets and one non-medical for general purpose.Here SAT stands for the model implemented by us and trained with the preprocessed MIMIC-CXR data.BLUE-n denotes the BLEU score that uses up to n-grams.Lungs remain well inflated without evidence of focal airspace consolidation, pleural effusions, pulmonary edema or pneumothorax.Irregularity in the right humeral neck is related to a known healing fracture secondary to recent fall.PA and lateral views of the chest at 09:55 are submitted.no findings.no pneumonia.no pleural effusion.no edema.there is little change and no evidence of acute cardiopulmonary disease.no pneumonia, vascular congestion, pleural effusion.ofincidental note is an azygos fissure, of no clinical significance.this raises possibility of a normal variant.1. Stable bilateral small pleural effusions and atelectasis.2. Enlarged pulmonary artery, suggesting pulmonary hypertension.No significant interval change.Bilateral small pleural effusions and adjacent atelectasis are overall unchanged.The heart is top-normal in size, unchanged.The pulmonary artery is enlarged, suggesting pulmonary hypertension.No demand, focal consolidation to suggest pneumonia, or pneumothorax.pleural effusion present.lung opacity present.no edema.cardiomegaly present.atelectasis present.as compared to previous radiograph, there is an increase in extent of a pre existing small left pleural effusion with subsequent atelectasis at left lung bases.otherwise, radiograph is unchanged.moderate cardiomegaly.mild fluid overload no overt pulmonary edema.no new focal parenchymal opacities suggesting pneumonia.unchanged position of right pectoral port a cath.
There is decrease in now small right pleural effusion.There is no pneumothorax.There is a new right pacer pigtail catheter.Cardiomediastinal contours are unchanged.Lines and tubes are in standard position.Left lower lobe opacities, a combination of pleural effusion and atelectasis, are unchanged.uncertainpneumonia.pleural effusion present.lung opacity present.atelectasis present.bilateral pleural effusions, left greater than right.bibasilar opacities potentially atelectasis in setting of low lung volumes.infection be excluded.frontal and lateral views of chest demonstrate low lung volumes, which accentuate bronchovascular markings.there are small bilateral pleural effusions, right greater than left, with adjacent atelectasis.there is no focal consolidation pneumothorax.cardiomediastinal silhouette is within normal limits.

Table 2 .
Results of generated reports by the proposed SAT + GPT-3 model.