ROCOv2: Radiology Objects in COntext Version 2, an Updated Multimodal Image Dataset

Automated medical image analysis systems often require large amounts of training data with high quality labels, which are difficult and time consuming to generate. This paper introduces Radiology Object in COntext version 2 (ROCOv2), a multimodal dataset consisting of radiological images and associated medical concepts and captions extracted from the PMC Open Access subset. It is an updated version of the ROCO dataset published in 2018, and adds 35,705 new images added to PMC since 2018. It further provides manually curated concepts for imaging modalities with additional anatomical and directional concepts for X-rays. The dataset consists of 79,789 images and has been used, with minor modifications, in the concept detection and caption prediction tasks of ImageCLEFmedical Caption 2023. The dataset is suitable for training image annotation models based on image-caption pairs, or for multi-label image classification using Unified Medical Language System (UMLS) concepts provided with each image. In addition, it can serve for pre-training of medical domain models, and evaluation of deep learning models for multi-task learning.


Background & Summary
Recent years have seen tremendous progress in medical imaging.The advent of deep learning techniques has enabled the development of sophisticated models for image analysis tasks.Multimodal image datasets play a crucial role in the development and validation of these models.One such dataset is Radiology Objects in COntext (ROCO) 1 , which has enabled researchers to develop models for a wide range of tasks, including concept detection, caption generation, and image-text retrieval.
The first version of the ROCO dataset was introduced by Pelka et al. 1 in 2018.It includes image-caption pairs from two classes: Radiology images from multiple imaging modalities and Out-of-Class images, such as synthetic radiology figures, digital art, and portraits, from peer-reviewed publications in the open-access subset of the biomedical literature database PubMed Central (PMC) 2 .The dataset contains 81,825 radiology images and 6127 out-of-class images.In addition to the images and their captions, the dataset provides keywords, Unified Medical Language System ® (UMLS ® ) Semantic Types (SemTypes), and UMLS Concept Unique Identifiers (CUIs) for each image.This information makes the dataset suitable for training image annotation models based on image-caption pairs, or for multi-label image classification using the UMLS concepts provided with each image, e.g., to develop systems supporting structured medical reporting.The dataset is available on GitHub (available at https://github.com/razorx89/roco-dataset,accessed 2024-03-12) in the form of links to the publication and scripts to download and extract the images from them.
The ROCO dataset has been used in the medical caption tasks [3][4][5][6] at the Image Retrieval and Classification Lab of the Conference and Labs of the Evaluation Forum (ImageCLEF) 7 .
ROCOv2 is the result of more than four years of updates and improvements to the original ROCO dataset.Due to the focus on radiological images, ROCOv2 does not include out-of-class images like ROCO, and to allow direct distribution of the images, only images from CC BY licensed articles (including CC BY-NC, but excluding CC BY-ND and CC BY-SA) are included.
Other changes include manually curated concepts, e.g., for modality of all images, anatomy and directionality of X-ray images, and improved concept extraction with the Medical Concept Annotation Toolkit v1.10.0 (MedCAT) 8 , which is based on a newer version of the UMLS database and uses word embeddings instead of QuickUMLS 9 that relies on direct dictionary matches.In addition, better concept filtering has been introduced.
The ROCOv2 dataset serves as a valuable resource for various applications and use cases in the medical domain, as it contains a vast amount of biomedical knowledge stored in the literature.One of the primary applications of this dataset is to train and evaluate models across different modalities for tasks such as image caption generation.By leveraging the multimodal nature of the dataset, researchers can develop models that accurately describe the content of radiological images, facilitating better understanding and communication of medical findings.Furthermore, the ROCOv2 dataset can be utilized to build and train an efficient image retrieval system specifically tailored for the medical domain.Such a system would allow healthcare professionals to quickly search for relevant radiological images based on specific queries or similar case studies, enhancing the decision-making process and enabling more informed patient care.The image-caption pairs available in the dataset provide a rich foundation for training these retrieval models, ensuring accurate and relevant results.This can also be further extended to build a multimodal retrieval augmented generation (RAG) system that can be used in tasks such as generating detailed medical reports, or answering complex clinical questions.Multimodal RAG also opens up additional possibilities beyond image retrieval.By combining the visual information from radiological images with the textual data from captions and associated medical literature, generative AI models can be trained or fine-tuned to produce more comprehensive and context-aware outputs.This approach can lead to the generation of synthetic data, which is particularly useful in cases where real-world medical data is scarce or difficult to obtain.The generated data can be used to augment existing datasets, improve model robustness, and support further research and development in the field of medical AI.
All sources of the dataset are openly available as part of the PMC Open Access Subset at the time of the publication of the dataset.
Summarizing the main contributions of this work: • Dataset of 79,789 radiological images with associated captions and medical concepts • Possible use cases include training of image captioning, image retrieval and pre-training models • Concepts were automatically generated from the captions, and combined with manually curated concepts for modality (all images), body region (X-ray only), and directionality (X-ray only) • Images and captions were extracted from openly available publications with CC BY licenses in the PMC Open Access Subset

Related Work
Since its initial release, the ROCO dataset has been used as the foundation for generating training and test data for multiple iterations of the ImageCLEFmedical Caption tasks, up to and including the most recent edition in 2023.Beyond these tasks, the ROCO dataset has proven to be a valuable resource for medical imaging research, resulting in its inclusion in several studies over the years.For instance, Eslami et al. 10 investigated the effectiveness of Contrastive Language-Image Pre-training (CLIP) 11 for the task of Medical Visual Question Answering (MedVQA) by leveraging the ROCO dataset to fine-tune CLIP for the medical domain.They chose the ROCO dataset for training due to its comprehensive collection, which includes various imaging modalities such as ultrasound, X-ray, PET, CT, MRI, and angiography from different human body regions, such as the head and pelvis.Their research resulted in the creation of PubMedCLIP, a specialized vision encoder that outperformed the general CLIP on two MedVQA benchmarks.
In addition to the ROCO dataset, several other datasets containing both image and text data have been published and used in medical imaging research.
One of the most notable examples is the MIMIC-CXR 12 dataset.MIMIC-CXR is a large, publicly available collection of thorax radiology images paired with semi-structured free-text reports detailing radiological findings.The dataset includes 227,835 imaging studies from 65,379 patients, resulting in 377,110 images.Another dataset which also focuses on chest X-rays is the Open-I Indiana Chest X-ray collection 13 .The dataset consists of 3996 de-identified radiology reports and 8121 associated images.PADCHEST 14 is a large dataset consisting of more than 160,000 chest X-ray images and associated reports from 67,000 patients.The reports are annotated with 174 different radiographic findings, 19 differential diagnoses, and 104 anatomical locations, hierarchically organized and mapped to standard UMLS 15 terminology.Additionally, manual annotations include bounding boxes for subfigures and their corresponding subcaptions for a subset of 2069 figures, resulting in 7507 subfigure-subcaption pairs.Compared to this work, MIMIC-CXR, Open-I Indiana Chest X-ray, and PADCHEST focuses only on chest X-rays, whereas ROCOv2 includes a wide range of anatomical regions, medical concepts, and modalities.
In another work, Subramanian et al. 16 introduced MedICaT, a dataset of medical images, captions, and textual references.The dataset contains 217,060 images sourced from 131,410 open-access biomedical papers featuring captions and inline references for 74% of the figures.Additionally, manual annotations including bounding boxes for sub-figures and their corresponding subcaptions, are provided for a subset of 2069 figures.Based on the dataset, the authors introduced the task of aligning subfigures with their corresponding subcaptions in compound figures and highlighted the valuable role of inline references in facilitating image-text matching.In comparison to the ROCOv2 dataset, MedICaT contains images distributed under the CC BY-ND and CC BY-SA licences which prohibit the distribution of processed images or require the transfer under the same license.
Recently, Lin et al. 17 proposed PMC-CLIP, a pre-trained model that uses biomedical documents for contrastive languageimage pre-training based on PMC-OA, a biomedical dataset of 1.6 million image-caption pairs collected from the Open Access subset of PMC.The dataset covers various modalities and diseases, with the majority of the image annotation samples aligned at a fine-grained level, i.e., sub-figure and subcaption.The PMC-CLIP model achieves state-of-the-art results on several downstream tasks, including image-text retrieval on ROCO, MedMNIST 18 image classification, and medical VQA.To focus the PMC-OA dataset on biomedical images, filtering based on a keyword search and a deep learning-based classification model is used.Another pre-trained model, BiomedCLIP 19 , performs well on various biomedical imaging tasks.It was pre-trained on PMC-15M, a large-scale dataset that includes a diverse range of 15 million biomedical image-text pairs.In contrast, ROCOv2 uses manual validation as a filtering step to achieve a high-quality radiological image dataset.Similar to the MedICaT dataset, PMC-OA contains images distributed under the CC BY-ND and CC BY-SA licenses, which are excluded from ROCOv2, so that the images of the dataset can be directly distributed.
Expanding on these datasets, the PMC-VQA 20 dataset has recently been introduced with a focus on the MedVQA task.The dataset contains 226,946 VQA pairs and 149,075 images, covering various medical modalities and diseases.It provides a comprehensive basis for the development and evaluation of MedVQA models.Data generation started with 381k image-caption pairs from the PMC-OA dataset.These captions were used with ChatGPT to generate five question-answer pairs per caption, which then underwent a filtering process.Experiments with models trained on the PMC-VQA dataset have demonstrated superior performance on established benchmarks such as the Visual Question Answering in Radiology (VQA-RAD) 21 and Semantically-Labeled Knowledge-Enhanced (SLAKE) 22 datasets.In addition, the authors proposed a manually verified test set that is more challenging and reflects the complexity of the real world.As the PMC-VQA dataset is based on the PMC-OA dataset, the same drawbacks apply, including no manual filtering and no filtering based on licences.
Leveraging the integration of visual and linguistic elements in medical datasets, Moor et al. 23 proposed Med-Flamingo, a multimodal few-shot learner adapted to the medical domain.It is based on OpenFlamingo-9B 24 and has been further pre-trained on paired and linked medical image-text data from publications and textbooks.Its unique strength lies in generative MedVQA, especially for open-ended questions similar to United States Medical Licensing Examination (USMLE) style problems.It has demonstrated its effectiveness by improving performance in generative MedVQA by up to 20% on clinician ratings.The model was fine-tuned using the PMC-OA dataset.This dataset and the related problems have already been discussed.
In summary, while existing datasets such as MIMIC-CXR, PADCHEST, and PMC-OA have contributed to the field of medical imaging research, they have certain limitations.These datasets either focus on specific anatomical regions (e.g., chest X-rays), have license restrictions, or do not have manual validation.ROCOv2 aims to address some of these limitations by providing a diverse, manually validated dataset covering a wide range of anatomical regions, medical concepts, and modalities.In addition, by including only images with permissive licenses, ROCOv2 allows for the distribution of the dataset, facilitating its use in various research applications.

Dataset Creation
Figure 1 shows a schematic overview of the dataset creation workflow described below.The first step in creating the ROCOv2 dataset was to download the full PMC Open Access Subset via FTP, including all archives added until 2022-10-27.4,798,923 archives, occupying 22 TB of disk space, were downloaded in this manner.
After extracting the archives, which include the PDF of the paper, as well as any images contained in the paper and an XML representation of the paper, more than 16,324,613 million extracted images are run through two binary classification models.The first is used to filter for non-compound images, while the second is used to filter for radiological images.The models are part of the original ROCO workflow described in 1 and achieved accuracies of about 90% and 98.6%, respectively.After this filtering step, 188,537 images (102,807 articles) were left, which were further reduced to 119,140 images (67,357 articles) by filtering out images from papers not licensed under CC BY or CC BY-NC as well as images which are subject to copyright of other commercial organizations or individuals, and by removing 2056 duplicates identified using AntiDupl v2.3.10 (available at https://github.com/ermig1979/AntiDupl/,accessed 2023-11-10).
From the remaining images, the new validation and test sets were created using 21,545 images from 2021 and 2022 that were not previously used in the ImageCLEFmedical Caption datasets.They were manually annotated for modality (angiography, CT, MRI, PET, ultrasound, X-ray, and combined modalities), Image Retrieval in Medical Applications (IRMA) 25 body region (X-ray only), and directionality (X-ray only) and combined with 34,900 images that were part of the original ROCO dataset and 28,086 images that had been used in previous ImageCLEFmedical Caption datasets.Stratified random sampling based on the manually curated concepts was used to divide the images into validation and test sets, and generated concepts that did not appear in the training set were removed from the validation and test sets.
The resulting 84,530 images (46,904 articles) were filtered for valid captions.1528 images with non-English captions were removed.In addition, very short captions without relevant information (e.g., "Figure 1

Caption Processing and Concept Extraction
To extract concepts from the captions, several pre-processing and filtering steps were performed.First, captions in languages other than English were excluded to focus the analysis on English-language concepts.This was done using the fastText 27 language identification model.Captions identified as non-English with a confidence level greater than 45% were excluded from the dataset.To reduce the risk of erroneously removing English captions, any caption identified as non-English with a confidence level of less than 45% was retained under the assumption that it was likely written in English.Of all the captions, 1528 were identified as non-English.The bar chart in Table S13 in the Supplementary Information shows the frequency of non-English captions across the different languages in the dataset.French captions were the most common with a count of 1413, followed by Portuguese and Spanish with 55 and 48 captions, respectively.Next, URLs within the captions were identified and removed, as they often do not provide relevant information for concept extraction.In addition, some captions were identified as consisting entirely of LaTeX code, and these were also removed from the dataset.Empty captions and those containing minimal information, such as "xxx", were discarded during pre-processing.
After the initial pre-processing, the remaining captions were further processed to extract relevant concepts using the Medical Concept Annotation Toolkit (MedCAT) 8 framework.MedCAT is a robust tool specifically designed for extracting and linking biomedical concepts from unstructured text.It was trained on the MIMIC-III 28 dataset (as of 2022-03-15) and links to Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) IDs.The SNOMED CT IDs were then mapped to UMLS2022AB release CUIs and semantic types (TUIs), which were then used for concept extraction and filtering.
During concept extraction, a frequency cutoff was implemented that retained only concepts that exceeded a frequency threshold of 10.In this way, low-frequency concepts, of which only a few examples were available, were effectively filtered out.By linking concepts to the UMLS, associated semantic types were filtered to focus on concepts that are likely to be visually observable and interpretable in the images.For example, concepts with the associated UMLS semantic type T029 (Body Location or Region) or T060 (Diagnostic Procedure) are relevant, while concepts of semantic type T054 (Social Behavior) cannot be derived from the image through a model.Specific concept filters were then manually applied to exclude UMLS concepts that could not be directly associated with the image content, such as temporal or qualitative aspects of certain concepts.Blacklisted concepts often contain qualifiers that would distract from the actual interest, and would also introduce bias since qualifiers are used in highly individual and variable ways by the original authors.Entity linking systems sometimes tend to incorrectly link concepts with ambiguous synonyms, e.g.C0994894 (Patch Dosage Form) may be linked if the caption refers to a region described as patchy.In the case of high frequency of such concepts, they have been manually mapped to the correct CUI.
In addition to the described automatic concept extraction, further manual creation and validation of basal concepts were performed.As in the original ROCO dataset 1 , this mainly focused on the seven supported image modalities (angiography, combined modalities (e.g.PET/CT), CT, MRI, PET, ultrasound, and X-ray).However, for the ROCOv2 dataset, additional concepts were introduced for the X-ray modality, focusing on: (i) the displayed or described body region, and (ii) the directionality on which a projection was based.The body region classification was based on the IRMA classification 25 , which distinguishes eight different regions (abdomen, breast, chest, cranium, lower extremity, pelvis, spine, and upper extremity).The directionality classification was based on a reduced set of the most commonly used directions (coronal anteroposterior, coronal posteroanterior, sagittal, and transversal), but was introduced for experimental purposes only.
The reason for the manual creation of concepts is that in many cases the captions either do not explicitly provide information about them (e.g."T2 weighted" implies MRI modality, "cholangiography" implies abdominal region).
The manual concept creation pipeline involved an initial manual classification of subsets of tens of thousands (modality, directionality) or several thousand (body region) images.This was done by two annotators for image modalities and X-ray body regions, and a single annotator for X-ray directionalities.The respective annotation guidelines followed by the annotators are provided in a distilled form as supplementary material.Deep learning image classification models were then trained on these subsets to pseudo-label the remaining images as a preliminary sorting method.These were then manually curated again to resolve errors in the classification models.Finally, the quality of the manually created concepts was validated by a radiologist on representative subsets for each category, with results described in the technical validation section.
The concepts from both manual and automatic extraction were combined, with priority given to the manually curated concepts.Automatically extracted modality concepts were included only for combined modalities (DRCO).The manually identified anatomy and directionality concepts of X-rays were not checked for conflicting concepts from automatic extraction to be integrated.Figure 2 shows the concept extraction and caption pre-processing workflow.

Data Records
The ROCOv2 dataset files are available on Zenodo 26 .It contains images, captions, and concepts for training, validation, and test splits, as well as image license information.
• "{train,valid,test}_captions.csv":Two-column comma-separated value (CSV) files with image filenames and corresponding captions (escaped with double quotes if necessary).• "{train,valid,test}_concepts.csv":Two-column CSV files with image filenames and corresponding medical concept CUIs separated by semicolons.Includes manually curated and automatically generated concepts.
A dataset analysis is performed as part of the technical validation in the following section.

Technical Validation
The ROCOv2 26 dataset is based on the dataset used in the medical caption task 30 at the ImageCLEF 2023 31 , where participants had access to the training and validation sets after signing a user agreement.ImageCLEF 2023 consists of the ImageCLEFmedical, ImageCLEFfusion, and ImageCLEFaware labs, where the ImageCLEFmedical lab is divided into the subtasks MEDIQA-Sum (natural language semantic retrieval), Caption, GANs (medical image generation) and MedVQA-GI (gastrointestinal visual question answering).The ImageCLEFmedical Caption task consists of two subtasks: concept detection and caption prediction.All results are also described in detail in the overview paper 30 .Since several improvements were made to the dataset compared to the one used in ImageCLEFmedical Caption 2023, such as the removal of approximately 1500 non-radiological images and the addition of approximately 3000 missing manually curated concepts, baseline results are reported separately for both datasets.
The results for both subtasks show that the baseline models achieve similar results on the ImageCLEF dataset as the challenge participants while performing better on the ROCOv2 26 dataset, showing the improved dataset quality.The baseline results, along with several years of competitive and improving scores for both subtasks in the context of the ImageCLEFmedical Caption challenges show the suitability of the dataset for training models for concept detection and caption prediction.

Concept Detection
For the concept detection, participants are asked to predict a set of concepts defined by the UMLS CUIs 15 based on the visual information provided by the radiology images, which can help in the development of systems supporting structured medical reporting.The balanced precision and recall trade-off were measured in terms of sample-averaged F1-scores, with a separate F1-score being calculated for manually curated concepts.
In the ImageCLEFmedical 2023 Caption challenge, the best team achieved an F1-score of 0.5223 using an ensemble of three multi-label classification models with different architectures 32 .Additional results are shown in Table 1.
Table 1.Performance of the participating teams in the ImageCLEFmedical 2023 Concept Detection subtask.Only the best run based on the achieved sample-averaged F1-score is listed for each team, together with the corresponding secondary sample-averaged F1-score based on manual annotations as well as the team rankings based on the primary and secondary F1-score.The full results are shown in the overview paper 30 .The best results are highlighted in bold.For comparison, the baseline results on the ImageCLEF dataset and on the ROCOv2 26  As previously mentioned, the ROCOv2 26 dataset is an improved version of the dataset provided in the ImageCLEFmedical Caption task.In order to compare the challenge results to the results of the ROCOv2 dataset, two baseline models, namely an EfficientNet-B0 40 and an EfficientNetv2-s 41 , were additionally trained and results for both datasets are given.
For both models, a grid search was performed on the validation dataset for hyperparameter tuning to identify the best combination of optimizer and learning rate.The values [1e-1,1e-2,...,1e-5] are used as candidates for the learning rate.In addition, the Adam 44 , Stochastic Gradient Descent (SGD), and Root Mean Square Propagation (RMSProp) optimizers were tested.After hyperparameter tuning, the final models were trained on the entire training and validation dataset.To train both models, the training augmentation pipeline includes loading the images with an image size of 1.25 times the model image size, random horizontal and vertical flipping with a probability of 0.5 each, random cropping to the image size of the model, and image normalization.The validation and test augmentation pipelines include loading the images with an image size of 1.25 times the model image size, center cropping to the image size of the model and image normalization.The loss function used is a multi-label soft margin loss.A sigmoid activation function is used for all model outputs with a threshold of 0.5.All models are trained using mixed precision 45 for 20 epochs.For the remaining hyperparameters, the default values in PyTorch are used.
The model based on the EfficientNet-B0 architecture was pre-trained on the ImageNet-1k dataset 46 .This model was trained with a batch size of 256, a drop rate of 0.2, and an image size of 224.During hyperparameter tuning, the Adam optimizer, trained with a learning rate of 1e-3, achieved the best sample-averaged F1-score on the validation dataset for both datasets (ImageCLEFmedical Caption and ROCOv2 26 ).The final model achieved a sample-averaged F1-score of 0.5099 (secondary sample-averaged F1-score: 0.9309) for the ImageCLEFmedical Caption test set and a sample-averaged F1-score of 0.5811 (secondary sample-averaged F1-score: 0.9353) for the ROCOv2 test set.The EfficientNetv2-s model was pre-trained on the ImageNet-21k dataset 47 .This model was trained with a batch size of 92, a drop rate of 0.2, and an image size of 300.During hyperparameter tuning, the RMSProp optimizer trained with a learning rate of 1e-4 achieved the best F1-score on the validation dataset for both datasets (ImageCLEFmedical Caption and ROCOv2 26 ).The final model achieved a sample-averaged F1-score of 0.5215 (secondary sample-averaged F1-score: 0.9407) for the ImageCLEFmedical Caption test set and a sample-averaged F1-score of 0.5925 (secondary sample-averaged F1-score: 0.9430) for the ROCOv2 test set.

Caption Prediction
The caption prediction aims to automatically generate captions for the radiology images provided.In ImageCLEFmedical Caption, the performance of caption prediction is evaluated based on BERTScore 48 , which is a metric that computes a similarity score for each token in the generated text with each token in the reference text.Several other metrics were also calculated and published, to illustrate how difficult the evaluation of caption similarity is: First, the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) 49 score was adopted as a secondary metric that counts the number of overlapping units such as n-grams, word sequences, and word pairs between the generated text and the reference.In addition to ROUGE, the Metric for Evaluation of Translation with Explicit ORdering (METEOR) 50 was explored, which is a metric that evaluates the generated text by aligning it to reference and calculating a sentence-level similarity score.Furthermore, the Consensus-based Image Description Evaluation (CIDEr) 51 metric was also adopted.CIDEr is an automatic evaluation metric that calculates the weights of n-grams in the generated text, and the reference text based on term frequency and inverse document frequency (TF-IDF) and then compares them based on cosine similarity.Another metric used is the BiLingual Evaluation Understudy (BLEU) score 52 , which is a geometric mean of n-gram scores from 1 to 4. For this task, the focus was on the BLEU-1 score, which takes into account unigram precision.Bilingual Evaluation Understudy with Representations from Transformers (BLEURT) 53 is specifically designed to evaluate natural language generation in English.It uses a pre-trained model that has been fine-tuned to emulate human judgments about the quality of the generated text.CLIPScore 54 is an innovative metric that diverges from the traditional reference-based evaluations of image captions.Instead, it aligns with the human approach of evaluating caption quality without references by evaluating the alignment between text and image content.
For the caption prediction subtask at ImageCLEFmedical 2023, the best team achieved a BERTScore of 0.6413 with an encoder-decoder framework with subsequent reinforcement learning 55 .Additional results are shown in Tables 2 and 3.
Table 2. Performance of the participating teams in the ImageCLEFmedical 2023 caption prediction subtask.Only the best run based on the achieved BERTScore is listed for each team, together with the corresponding secondary ROUGE score as well as the team rankings based on the primary BERTScore and secondary ROUGE score.Additional scores are shown in Table 3.The full results are shown in the overview paper 30 .The best results are highlighted in bold.For comparison, the baseline results on the ImageCLEF dataset and on the ROCOv2 dataset are included at the bottom.

Group Name
BERTScore ROUGE Rank (secondary) As a baseline for the caption prediction task, a model leveraging a vision encoder-decoder architecture 58 was employed.The encoder component was instantiated with the base-sized (available at https://huggingface.co/google/ vit-base-patch16-224-in21k, accessed 2023-11-10) Vision Transformer (ViT) model 59 .This model is based on the Table 3. Performance of the participating teams in the ImageCLEFmedical 2023 caption prediction subtask for additional metrics BLEURT, BLEU, METEOR, CIDEr and CLIPScore.These correspond to the best BERTScore-based runs of each team, listed in Table 2.The full results are shown in the overview paper 30 .The best results are highlighted in bold.For comparison, the baseline results on the ImageCLEF dataset and on the ROCOv2 dataset are included at the bottom.This training corpus provides BioMedLM with a strong understanding of biomedical language, making it uniquely suited for tasks in the biomedical domain.The model was trained using the Huggingface Transformers library for two epochs with a batch size of two and a gradient accumulation of two.The maximum input sequence length was set to 128 tokens.As an optimizer, the Adafactor optimizer was chosen due to its efficiency in memory usage and its adaptability in adjusting learning rates.In addition, fp16 mixed precision was used during training.

Group
The baseline model was first trained and evaluated on the ImageCLEFmedical Caption dataset.During this initial evaluation, the model demonstrated good performance, achieving a BERTScore of 0.6217 and a ROUGE score of 0.2318.It also obtained a CLIP score of 0.8172, a BLEURT score of 0.3093, a BLEU score of 0.1821, a METEOR score of 0.0813, and a CIDEr score of 0.1968.The same model architecture was then trained and evaluated on the ROCOv2 26 dataset.The model achieved a BERTScore of 0.6241 and a ROUGE score of 0.2325, along with a CLIP score of 0.8212, a BLEURT score of 0.3131, a BLEU score of 0.1836, a METEOR score of 0.0830, and a CIDEr score of 0.2026.

Dataset Analysis
Table 4 shows the descriptive statistics derived from the dataset.On average, each caption contains about 21 words but can range from one word to 848 words.Each article contains about 1.76 image-caption pairs on average, with some articles having up to 28 image-caption pairs.In addition, on average, captions are annotated with about 3.36 UMLS concepts, with some captions having as many as 28 concepts.The histogram in Figure 3 shows the distribution of the number of concepts per image-caption pair.All captions have at   The ten most common semantic types (TUIs) are outlined in Table 6.These types provide insights into various medical concepts, ranging from diagnostic procedures to neoplastic processes.

Annotator-Radiologist Evaluation of Manual Concepts
To assess the overall quality and validity of the manually created concepts, an evaluation of the agreement between annotators and a radiologist was performed.To determine this, a representative subset of the dataset was created for each manual concept category: (i) modality of all images, (ii) displayed body region of X-ray modality images, and (iii) directionality of X-ray modality images.Representativeness was ensured by stratifying the corresponding labels.A radiologist then manually labeled each subset, independently going through the same process as the annotators and following the same labeling guidelines for each category.
Subsets were extracted from an internal, raw version of the original ImageCLEFmedical Caption 2023 dataset that included additionally the labels OTHER and UNKNOWN for later filtering and refinement purposes (e.g., out-of-class images, mixed-class images, uncertainty regarding distinct label assignment).Their exact meaning for each category is described in the annotation guidelines in the supplementary material.This was done to not bias the evaluation by applying premature filtering that may have excluded images that would have been valid in a radiologist's eye.

Limitations
The entire dataset is sourced from the Open Access Subset of the PMC database.This naturally introduces a bias in terms of the selected images on the one hand, and quality issues inherent to the PMC on the other hand.One example of such quality issues is a lower image quality in the PMC archive compared to the published article.Another very rare, but often impossible to manually correct issue is the occasional mix-up of images, where images are reproduced with wrong captions, sometimes taken from a different publication.
A fundamental limitation is represented by faulty or fuzzy original captions that serve as ground truth.This became apparent during the process of manual concept creation and evaluation, where the annotators and radiologist involved reported various discrepancies.Affected captions involved, e.g., modality confusion where obvious CT images were labeled as MRI images, or ambiguous statements where pure CT images were labeled as PET/CT images because they were captured with a combined scanner unit.Another common issue was a lack of detail and context in the original captions.This included, for example, not specifying the modality or, in the case of follow-up images in a series, generally no context on the region depicted.Thus, samples may lack a sufficient set of concepts that would be needed to comprehend the contents and context of an image.
Due to the inherent imbalance in modality distribution, certain modalities such as positron emission tomography or combined modalities as per ROCO definition are relatively rare.Although this distribution reflects the rarity of these modalities within publications in general, it should be taken into account if the dataset is to be used in the context of rare modalities.
Semantic types representing concepts identifiable from images were selected based on consent and best effort, but not by dedicated medical personnel, due to the lack of a well-defined process and resources.
Additionally, to maximize sample size and variety, the dataset includes images with at least one concept.However, this may be a limiting factor, as single-concept images may lack the complexity required for comprehensive model training.Therefore, users are advised to consider the impact of single-concept images on the effectiveness of their models and adjust their selection criteria accordingly.
Several limitations apply in regards to the manually created concepts for image modalities, body regions (X-ray only), and projection directionalities (X-ray only), meant to validate automatically generated ones and to substitute missing ones.The performed validation by a radiologist showed a generally high agreement in regards to modality-and (X-ray only) body region-related concepts.In regards to the modality, only for the positron emission (DRPE) and the angiography (DRAN) modalities an agreement in the moderate respectively substantial range was observed.For the positron emission modality this be explained with the low sample count n = 2) in the evaluation that fosters a strong bias in occasional disagreement.For the angiography modality, images labeled as X-ray modality by the radiologist can be ascribed to a conservative stance, e.g., barely visible traces of contrast agent without explicit mention of a performed angiography within the caption may have been labeled as X-ray modality.In regards to the (X-ray only) body region, only for the pelvis and spine region an agreement in the moderate respectively substantial range was observed.For both, this can be explained by limitations in the determination of anatomical regions.For example, diseases such as osteoarthritis or advanced osteonecrosis of the hip might affect both the acetabulum and femoral head and thus might be categorized as either pelvis or lower extremity, leading to potential labeling inconsistencies.These will be resolved as part of future work through implementation of multi-label assignment.Additionally, some anatomical regions, such as those focused on the soft tissue of the neck, could not be assigned a class at all due to minor limitations within the IRMA classification system.Yet, in regards to (X-ray only) directionality-related concepts the achieved moderate agreement can not only be explained by said conservative stance of the radiologist during evaluation.It further indicates the complexity of the given task, as even for experienced professionals distinguishing between anteroposterior and posteroanterior directionalities is non-trivial when not provided additional context.A general problem further lies in the very common lack of said additional context as well as the deliberate reduction of directional complexity to only four classes that occasionally do not leave room for sufficient differentiation.For instance, while labeling the directionality of a dorsopalmar hand radiograph as coronal posteroanterior is not entirely accurate, this was done here to not leave a notable amount of X-ray samples without a directionality label.Due to these limitations, manual concepts have been documented distinctively, so dataset users have the possibility to decide on their own whether to use or exclude them from given concepts for images.

Usage Notes
The images are provided exactly as they appear in the PMC Open Access Subset archive and must be resized or cropped before being used in a machine learning workflow.Also note that the images provided in the PMC Open Access Subset may be of different quality than the images included in the journal.Some possible use cases for the ROCOv2 26 dataset include pre-training models for handling radiological images, building systems to support structured medical reporting, as well as building multi-label medical concept classification models and caption prediction models as done in the ImageCLEFmedical Caption tasks, which can be used to support structured medical reporting.Another use case is the evaluation of deep learning models for multi-task learning.

Label Matching criteria
General information: For each image the given caption is to be checked when its modality can not be determined with very high certainty by its visual features.In case its caption does not provide sufficient information or is ambiguous, checking its context within its original publication is demanded.Images may comprise annotations, e.g., arrows, markers, or sketches, up to a limit where these are dominating the visual.Images with a veterinary focus are to be included.

DRUS
Classic 2D/3D ultrasound images.These may also represent user interface screenshots or comprise additional data on, e.g., flow series, Doppler visualization, or location.DRXR Classic 2D X-ray projections, but also X-ray panorama shots of the jaw region.The latter need to be differentiated from Cone Beam CT (CBCT) shots that share a certain resemblance.Methods using contrast media, e.g., in Barium swallow, urethrography, arthrogram, or cholangiography, are considered regular X-ray projections.However, methods using contrast media to specifically highlight blood vessels have to be considered as angiography modality (DRAN).

OTHER
Label that pools diverse non-radiological and out-of-class images, later to be excluded from the raw dataset.This comprises compound images that incorrectly passed the compound filtering mechanism, vast 3D rendered images, synthetically generated images (e.g., by generative adversarial networks, or diffusion networks), photographs/sketches/collages (e.g., scenes, schematics, workflows, or multiple images of different contexts), non-medical modalities (e.g., transmission electron microscopy, or satellite images), non-supported modalities (e.g., optical coherence tomography (OCT), endoscopy, or histological slices), images representing a supported modality but not showing a specific medical context (e.g., assessment of materials, phantoms, or artifacts), images that contained captions as image parts and thus could leak information when using optical character recognition (OCR) techniques, and several others.

UNKNOWN
Label that pools images which can not be labeled with certainty due to insufficient information from visual features, captions, and publication context, later to be excluded from the raw dataset.

Figure 1 .
Figure 1.Schematic overview of the dataset creation workflow.Based on Pelka et al. 1

Caption:Figure 2 .
Figure 2. Concept extraction and caption pre-processing workflow.Radiology image taken from Ali et al.29

Figure 4 .
Figure 4. Annotator-Physician evaluation on manually labeled image modality for a representative subset of n = 400 samples.Confusion matrix (normalized).

Figure 5 .
Figure 5. Annotator-Physician evaluation on manually labeled image modality for a representative subset of n = 400 samples.Confusion matrix (absolute).

12 / 31 L
A B E L : a b d o m e n L A B E L : b r e a s t L A B E L : c h e s t L A B E L : c r a n iu m L A B E L : lo w e r _ e x t r e m it y L A B E L : p e lv is L A B E L : s p in e L A B E L : u p p e r _ e x t r e m it y O T H E R U N K N O

Figure 6 .OFigure 7 .
Figure 6.Annotator-Physician evaluation on manually labeled body regions in X-ray modality images for a representative subset of n = 400 samples.Confusion matrix (normalized).

Figure 8 .O
Figure 8. Annotator-Physician evaluation on manually labeled directionalities in X-ray modality images for a representative subset of n = 399 samples.Confusion matrix (normalized).

Figure 9 .
Figure 9. Annotator-Physician evaluation on manually labeled directionalities in X-ray modality images for a representative subset of n = 399 samples.Confusion matrix (absolute).

Figure S2 .
Figure S2.The relative and absolute distribution of CC BY-ND or CC BY-SA licensed PubMed articles from the years 1987 to 2022.Images from these articles were not considered for the dataset based on their license.When comparing the numbers with FigureS1, it can be seen that these licenses have become increasingly popular from 2011 with a peak in 2017.
Image width, height distributions of train, valid, test sets separated by modality.
") were removed, resulting in a final dataset of 79,789 images (44,975 articles), with 59,958 images in the training set, 9904 images in the validation set, and 9927 images in the test set with 1947 unique CUIs overall, 1947 in the training set, 1760 in the validation set and 1754 in the test set.The detailed labeling and concept generation workflow is described in the next section.Compared to the dataset used in the ImageCLEFmedical Caption 2023 task, approximately 1500 non-radiological images were removed, further improving the quality of the dataset.Of the 81,825 radiology images in the original ROCO dataset, 33,645 were incorporated into ROCOv2 26 , with the rest being excluded due to their license.They were combined with 46,144 images which have been used in ImageCLEFmedical Caption challenge datasets from 2021 to 2023.
dataset are included at the bottom.

Table 4 .
Descriptive statistics of the ROCOv2 dataset.

Table 5
lists the ten most common UMLS concepts found in the dataset, excluding modality-related concepts that were set manually.

Table 5 .
The ten most common UMLS concepts.

Table 6 .
The ten most common semantic types (TUIs).Note that each caption can contain several CUIs of the same semantic type.

Table S13 .
Distribution of non-English captions.
DRAN2D X-ray projections that show application of contrast media for the specific purpose of highlighting blood vessels, e.g., ventriculography, aortography, or arterio/-venography.Angiographic methods performed using CT/MRI are not to be considered as part of this class, but as CT modality (DRCT) respectively MRI modality (DRMR).DRCO Combination of supported modalities, e.g., PET-CT/MRI, or SPECT-CT/MRI.However, CT/MRI scans with angiographic components are not considered.DRCT Classic CT scan images, e.g., single slices, or accumulations of few slices.Images involving contrast media are considered regular CT scans, also those with angiographic components.Vast 3D representations rendered from CT scan volume information are not considered.Combined with other imaging techniques, e.g.PET, SPECT, or likewise, images need to be considered as combined modality (DRCO).DRMR Classic MRI scan images, e.g., single slices, or accumulations of few slices.Images involving contrast media are considered regular MRI scans, also those with angiographic components.Vast 3D representations rendered from MRI scan volume information are not considered.Combined with other imaging techniques, e.g.PET, SPECT, or likewise, images need to be considered as combined modality (DRCO).DRPE Solely classic PET scans.Other similar but not positron emission-based methods, e.g., SPECT, or scintigraphy, are not considered.PET scans combined with CT/MRI are have to be considered as combined modality (DRCO).