A Real-world Dataset and Benchmark For Foundation Model Adaptation in Medical Image Classification

Foundation models, often pre-trained with large-scale data, have achieved paramount success in jump-starting various vision and language applications. Recent advances further enable adapting foundation models in downstream tasks efficiently using only a few training samples, e.g., in-context learning. Yet, the application of such learning paradigms in medical image analysis remains scarce due to the shortage of publicly accessible data and benchmarks. In this paper, we aim at approaches adapting the foundation models for medical image classification and present a novel dataset and benchmark for the evaluation, i.e., examining the overall performance of accommodating the large-scale foundation models downstream on a set of diverse real-world clinical tasks. We collect five sets of medical imaging data from multiple institutes targeting a variety of real-world clinical tasks (22,349 images in total), i.e., thoracic diseases screening in X-rays, pathological lesion tissue screening, lesion detection in endoscopy images, neonatal jaundice evaluation, and diabetic retinopathy grading. Results of multiple baseline methods are demonstrated using the proposed dataset from both accuracy and cost-effective perspectives.


Background & Summary
In the new trend of training even larger and universal foundation models (e.g., Vision Transformers 1 , GPTs 2 , PubmedBERT 3 , and CLIP 4 ) using thousands of millions of data samples (sometimes in multiple modalities), developing cost-effective model adaptation methods for detailed applications become the new gold, especially when it only demands very few data samples.On the other side, the shortage of publicly accessible datasets in medical imaging has largely blocked the development and application of large-scale deep learning models (training from scratch) in many clinical downstream tasks.It is because obtaining quality annotations remains a tedious task for medical professionals, e.g., hand-label volumetric data repeatedly.Providing a few textbook sample cases is more logically feasible and complies with the training process of medical residents.In the domain of medical image analysis, it is even more valuable to promote such learning paradigms when diseased cases are often rare in comparison to the numerous amount of normal population.
The common fine-tuning scheme 5 with ImageNet 6 pre-trained models can diminish the need of large-scale data for the train-from-scratch scheme.However, it still requires a fair amount of data for faster fine-tuning while avoiding overfitting.Alternatively, few-shot methods could leverage more on the distinctive representation produced by the foundation models, which has succeeded in considerable language modeling 7 and vision 8,9 tasks.The existing techniques of adapting foundation models in medical image analysis 10,11 demand the employment of dedicated medical pre-trained models that is hard to produce even if self-supervised learning is utilized.Recently, cutting-edge techniques, e.g., prompt-based learning 12,13 1.Data summary of MedFMC leverage the foundation models pre-trained (via self-supervised learning, e.g., DINO 14 and MAE 15 ) using vast amounts of data from multiple modalities and domains and transfer these universal representations to tasks with very limited data 16,17 .The fundamental difference in technical routine has started reshaping the landscape of medical image analysis.Therefore, it is in urgent demand to set up datasets and benchmarks to promote innovation in this fast-marching research field and properly evaluate the performance gain and other cost-effective aspects.There are benchmarks 18,19 for the few-shot learning tasks.Nonetheless, they focus more on each individual data modality and task.Here, we will instead promote the generalizability of the few-shot learning methods, i.e., strengthening their overall performance on various data modalities and tasks.
In this paper, we proposed a novel dataset, MedFMC, with 22,349 images in total, which encapsulates five representative medical image classification tasks from real-world clinical daily routines.Fig. 1 presents sample images from each subset, and Table 1 shows the summary of data, including modality, number of samples, image size, classification tasks, and number of classes.Different from many existing public datasets in the medical domain, e.g., Chest X-rays [20][21][22] , MSD 23 , and HAM10000 24 , the proposed dataset and benchmark do not target advancing and evaluating the performance of each individual task with the conventional full-supervised training paradigm, which may require larger amount of data individually.Instead, we believe that this new dataset (as a union) provides valuable support to develop and evaluate generalizable solutions of adapting foundation models to a variety of medical downstream applications, e.g., using few samples as the prompts and the rest as testing standardly across all five tasks.In this study, we focus on 2D medical image classification as a start and cover the most common 2D medical imaging modalities.3D data and other tasks, e.g., detection and segmentation, will be expanded and investigated in future work.
The proposed datasets target promoting the following aspects of foundation model adaptation approaches: • Generalizability: The proposed dataset has the capacity to examine the generalizability of the evaluated method from multiple perspectives.First, the benchmarked approach should achieve superior performance on all five prediction tasks, which are largely varied in data modality and image characteristics.Additionally, the composed five subsets of data are diversified in image sizes, data sample numbers, and classification tasks (e.g., multi-class, multi-label, and regression ones), as shown in Fig. 2a.
• Performance on Rare Diseases (Tail Classes): The few-shot learning scheme fits perfectly for the long-tailed classification scenario, which often has only a few cases available for rare diseases in training.We will also face data scarcity in the testing phase, and separate evaluation metrics need to be recruited.The performance of algorithms on these tail classes can better reveal the power of pre-trained models and their adaptation techniques.
• Prediction Accuracy and Adaptation Efficiency: Besides evaluating the prediction accuracy of algorithms, we also pay attention to the efficiency of training (with fewer samples) in the cost of both data and computation.By combining both the accuracy and cost aspects in the evaluation metrics, we expect the advanced methods can further ease the effort of obtaining quality annotations and meanwhile lower the demand for computational resources.Illustratively, we present the benchmarking results of several common learning paradigms, e.g., fine-tuning and few-shot approaches.During the training phase, a small amount of randomly picked data (a few samples, i.e., 1, 5, and 10) are utilized for the initial training, and the rest of the dataset is employed for the validation.Approaches with advanced cross-domain knowledge transfer techniques are expected to achieve higher performance scores in such a setting.The final metrics are computed on an average of ten individual runs of the same testing process.

IRB Ethics Review and Exemption
The presented retrospective research study has been reviewed by each involved institute individually ).The Endo is approved by Renji Hospital Ethics Committee.The committee reviewed and waived consent since the research was a retrospective study, and the risk of disclosing patient privacy via the studied snapshot images was minimized.The ColonPath is derived from part of the DigestPath 2019 challenge data, accessible via https://digestpath2019.grand-challenge.org/Dataset/,which was originally approved by the Histo Pathology Diagnostic Center Ethics Committee with patients' consent waived.

Shared Pipeline for Data Collection and Annotation
Fig. 2b illustrates the general data sample collection and annotation pipeline.MedFMC is composed of data with five different modalities in medical imaging, i.e., chest radiography, pathological images, endoscopy photos, dermatological images, and retinal images.The entire process consists of three major steps.First, the original data are listed and fetched from various systems, e.g., X-rays in the picture archiving and communication system (PACS), blood test results in Health Information System (HIS), endoscopy photos in the workstations, etc. Detailed processes are varied from modality to modality, which will be introduced in detail individually.Then, standardized anonymization of patient information (mainly the DICOM images) is performed before leaving the hospitals using the DICOM Anonymizer tool provided by the RSNA MIRC 25 .All image data are converted into 12-bit PNG images while the original image sizes are preserved.All image samples are manually examined to redact any privacy-related text or objects recorded in the images.Finally, a two-stage annotation process is conducted by first generating the initial labels, e.g., annotated by the medical trainees, blood test results extracted from the HIS, and grading prediction from a pre-trained model using public datasets.Senior professionals with over ten years of experience in their specialty, e.g., radiologist, pathologist, gastroenterologist, ophthalmologist, and pediatrician, verify the annotation for each image.In the following sections, we will discuss specific settings for each subset.

ChestDR: Thoracic Diseases Screening in Chest Radiography
Chest X-ray is a regularly adopted imaging technique for daily clinical routine.Many thoracic diseases are reported, and further examinations are recommended for differential diagnoses.Due to the large amount and fast reporting requirements in certain emergency facilities, a swift screening and reporting of common thoracic diseases could largely improve the efficiency of the clinical process.Although a few chest x-ray datasets [20][21][22] are now publicly available, images with quality annotations (preferably verified by radiologists) are still a desired resource for training and evaluating the models.
A total of 4,848 frontal radiography images (from 4,848 patients) are provided in ChestDR, collected from two regional hospitals in Hubei and Jiangxi Province, China.A detailed distribution of 19 common thoracic diseases is presented in Fig. 3, which is sorted with the number of samples.Tail classes are highlighted in Red.Each PNG image is converted from the original DICOM files using the default window level and width (stored in the DICOM tags).The original image sizes are preserved.The initial disease labels are provided by a radiological resident (with the support of previously signed radiology reports) and then confirmed by a senior radiologist.

ColonPath: Lesion Tissue Screening in Pathology Patches
Pathology examination can support detecting early-stage cancer cells in small tissue slices.In the pathologist's daily routine, they are required to look over several dozens of tissue slides, a tiresome and tedious job.In clinical diagnosis, quantifying cancer cells and regions is the primary goal for pathologists.The approaches for the classification of pathological tissue patches are desired to ease this process.They can help screen whether it exists regions of malignant cells in the entire slide in a sliding window manner.
The pathology whole slide image (WSI) is originally collected from the Histo Pathology Diagnostic Center, which is also published and utilized in the DigestPath Challenge 2019 26 .Only the data for the lesion segmentation tasks are employed in this study.All WSIs were acquired during 2017-2019 with hematoxylin and eosin (HE) stains and scanned using the KF-BIO FK-Pro-120 slide scanner.Subsequently, the WSIs were re-scaled to ×20 magnification with a pixel resolution of 0.475 µm.Tissue patches are extracted from the WSI in a sliding window fashion with a fixed size of 1024 × 1024 and a stride of 768.A total of 396 patients' 10,009 large tissue patches (with a uniform size of 1024 × 1024 ) of colonoscopy pathology examination will be available in ColonPath.Positive and negative patch samples (with and without the lesion tissue, computed based on the existing lesion region labels) are illustrated in Fig. 4a along with the number of samples in each category.The initial labels (whether containing lesion tissues) are provided by a trainee in the pathology specialty (with the support of computed labels) and then confirmed by a senior pathologist.

NeoJaundice: Neonatal Jaundice Evaluation in Skin Photos
Jaundice commonly occurs in newborn infants.However, most jaundice is benign and does not require any interference.Conventionally, newborns must be monitored by taking a blood test to examine the bilirubin level.The potential toxicity of bilirubin might lead to severe hyperbilirubinemia and, in rare cases, acute bilirubin encephalopathy or kernicterus.Recent techniques utilized skin photos of three different parts of the infants, i.e., head, face, and chest, to estimate the total serum bilirubin in the blood so as to avoid the repeated invasive blood test for infants.
A total of 745 infants' 2,235 images (with an average size of 567 × 567 ) are collected in the NeoJaundice dataset from the Xuzhou central hospital.The initial binary labels are generated using the total serum bilirubin readings extracted from the hospital's health information system with a threshold of 12.9mg/dL and then confirmed by a senior experienced pediatrician.Samples of both low and high bilirubin levels are illustrated in Fig. 4b along with the number of samples in each category.Three images are acquired for each infant on body skins of the head, face, and chest, using digital cameras.The skin regions are surrounded by a standardized color card for color calibration purposes.

Endo: Lesion Classification in Colonoscopy Images
Colorectal cancer is one of the most common and fatal cancers among men and women around the world.Abnormalities like polyps and ulcers are precursors to colorectal cancer and are often found in colonoscopy screening of people aged above 50.The risks largely increase along with aging.Colonoscopy is the gold standard for the detection and early diagnosis of  such abnormalities with necessary biopsy on site, which could significantly affect the survival rate from colorectal cancer.Automatic detection of such lesions during the colonoscopy procedure could prevent missing lesions and ease the workload of gastroenterologists in colonoscopy.A total of 80 patients' 3,865 images (with an average size of 1280 × 1024 ) recorded during the colonoscopy examination on the workstations in Renji Hospital are produced in the Endo dataset.Four types of lesions, i.e., ulcer, erosion, polyp, and tumor, are included, which are illustrated in Fig. 5a along with the number of samples in each category.Non-relevant images are already excluded, while some noisy and degraded recordings remain to reflect the real-world data distribution.These noisy data are mainly caused by motions during the operation, which only occupy a small portion (<5%) of the images and often are labeled without any of the target lesions.The initial labels of lesions are performed by a junior gastroenterologist (with the support of health records and reports) and then confirmed by a senior experienced gastroenterologist.

Retino: Diabetic Retinopathy Grading in Retina Images
Diabetic retinopathy (DR) can lead to vision loss and blindness in patients with diabetes, mainly affecting the blood vessel in the retina.Therefore, it is important to have an exam of the retina each year for the early detection of DR.Currently, DR grading requires a trained ophthalmologist to manually evaluate color fundus photos of the retina, which is time-consuming and may delay the treatment of patients.Automated screening of DR has long been recognized and desired.
A total of 1,392 patients' fundus images (one from each patient with an average size of 2736 × 1824 ) from Shanghai Tenth People's Hospital are included in the Retino dataset, which is extracted from the retinal imaging workstations after the examination.Images are captured by Canon nonmydriatic fundus cameras that mainly adopted the 45 • macula-centered imaging protocol.Samples of retina images in each of the five grades are illustrated in Fig. 5b along with the number of samples in each grade.A DenseNet-121 (with ImageNet pre-trained model weights) is first fine-tuned using the dataset from Kaggle's "Diabetic Retinopathy Detection" challenge and produced the prediction for each image.Then, an ophthalmologist with over ten years of experience examined again based on the automated generated prediction, i.e., the presence of diabetic retinopathy on a scale of 0 to 4 (0: No DR; 1: Mild; 2: Moderate; 3: Severe; 4: Proliferative DR).

Data Records
Each dataset in MedFMC consists of all image data in a "images" folder and associated image-level labels for each image in a CSV file.Multi-label tasks (i.e., ChestDR and Endo) will have multiple columns with either 1 or 0 that represent the existence of corresponding disease patterns.Binary and multi-class classification tasks (i.e., ColonPath, NeoJaundice, and Retino) will have only a single label with the individual class number.The images are named differently across institutes, i.e., named with a random ID (ChestDR and NeoJaundice) and with a random ID together with the data of collection, not the examination (ColonPath, Endo, and Retino).

Technical Validation
Dataset Partition: Each image subset is divided into two parts: the few-shot pool and testing subsets.The few-shot pool consists of samples with about 20% randomly selected patients, and the count of each class must be larger than 10.The remaining samples are used for testing.In transfer learning, we use all the images in the few-shot pool for training and validate the deep-learning-based classifier models using testing.In the few-shot setting, we randomly picked images of 1, 5, and 10 patients for each class from the few-shot pool to build the support set, and the testing subset is reserved for the model evaluation.
We provide the data list of the few-shot pool and testing set together with sample lists of few-shot images in the repository (see the Code availability section).

Few-shot Learning Baseline
In the experiment, we employ two few-shot baseline methods, i.e., Meta-Baseline 27 and Visual Prompt Tuning (VPT) 16 .Meta-Baseline 27 is chosen here as a classic few-shot method to evaluate across all five datasets.The input images are converted to the embedding features via three backbone networks and pre-trained model settings, including DenseNet 121 layers (Dense121) with ImageNet pre-trained weights in supervised learning (SL) and a Swin Transformer (Swin-base) with pre-trained weights from both fully-supervised and self-supervised learning (SSL) schemes (SimMIM 28 , a form of Masked Auto-Encoder 15 ).Settings are specified when reporting the performance as shown in the left column of Table 2.We cluster the class centers in the support set using the extracted features and compute the cosine similarities between one image in the testing set and the class centers to determine the category.Additionally, we include VPT as an advanced method in training visual prompts for the few-shot classification tasks.In this case, a vanilla pre-trained model from the Swin-transformer repository (pre-trained on ImageNet21K and finetuned on ImageNet1k) is utilized to initialize the VPT-based few-shot tuning.We repeat the experiment 10 times (randomly picking few-shot samples) on the five medical image datasets and report the averaged testing results.

Transfer Learning Baseline
We run the fine-tuning experiments using three representative networks, including DenseNet, EfficientNet, and Swin Transformer, on the five medical image datasets.The Swin transformer model is pre-trained on ImageNet21K with self-supervised learning and then finetuned on ImageNet1k with labels.The others are also pre-trained using ImageNet but with supervised learning.In our experiments, the fine-tuning is performed as linear probing, i.e., only tuning the classifier (fully connected) layers since the parameters in the representation layers are also frozen for the few-shot baseline methods.We also experimented with finetuning the entire network, which could generally improve the performance by 1-2% in accuracy.During the training and inference stage, all the input images are padded and rescaled to 384*384 pixels.Common data augmentation tricks, i.e., random crop, resize, and horizontal flip, are adopted.The cross-entropy loss is employed as the loss function for the multi-class classification of three datasets, including ColonPath, NeoJaundice, and Retino, while the binary cross-entropy loss is computed for the multi-label classification of the remaining two datasets, i.e., ChestDR and Endo.The model parameters (except the fully connected classifier layer) are initialized by the ImageNet pre-trained model weights and frozen during the tuning.SGD optimizers with initial learning rates of 0.002 and 0.01 are applied for the model training of DenseNet and EfficientNet, respectively.The Swin transformer model is optimized by AdamW with an initial learning rate of 0.001.We trained these classification models on a single NVIDIA A100 for 20 epochs at a batch size of 8, using the framework of MMClassification 29 .

Figure 1 .
Figure 1.Overview of the proposed learning-to-vote learning framework.
Diversified tasks and samples (b) Data collection and annotation
, and patients consent to data sharing and the open publication of the data (otherwise waived as detailed below).The ChestDR is approved by Fengcheng People's Hospital Ethics Committee (Ref.2020 YiYanLunShen No.016) and Huanggang Hospital of Traditional Chinese Medicine Medical Research Ethics Committee (Ref.2020 LunShen No.003).The NeoJaundice is approved by Xuzhou Central Hospital Ethics Committee (Ref.XZXYLQ-20180517-008), and patients' consent was signed by the guardians of the children.The Retino is approved by Shanghai Tenth People's Hospital Ethics Committee (Ref.SHSY-IEC-4.1/20-154/01

Figure 5 .
Figure 5. Data samples and case summary of (a) Endo and (b) Retino.

Table 2 .
Results of few-shot learning baseline on MedFMC

Table 3 .
Results of sub-classes with meta-baseline and 10-shot patient data.

Table 4 .
Results of transfer learning baseline on MedFMC with 10-shot and 20% patient data.