MedMNIST v2 - A large-scale lightweight benchmark for 2D and 3D biomedical image classification

We introduce MedMNIST v2, a large-scale MNIST-like dataset collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into a small size of 28 × 28 (2D) or 28 × 28 × 28 (3D) with the corresponding classification labels so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST v2 is designed to perform classification on lightweight 2D and 3D images with various dataset scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression, and multi-label). The resulting dataset, consisting of 708,069 2D images and 9,998 3D images in total, could support numerous research/educational purposes in biomedical image analysis, computer vision, and machine learning. We benchmark several baseline methods on MedMNIST v2, including 2D/3D neural networks and open-source/commercial AutoML tools. The data and code are publicly available at https://medmnist.com/.


Background & Summary
Deep learning based biomedical image analysis plays an important role in the intersection of artificial intelligence and healthcare [1][2][3] .Is deep learning a panacea in this area?Because of the inherent complexity in biomedicine, data modalities, dataset scales and tasks in biomedical image analysis could be highly diverse.Numerous biomedical imaging modalities are designed for specific purposes by adjusting sensors and imaging protocols.The biomedical image dataset scales in biomedical image analysis could range from 100 to 100,000.Moreover, even only considering medical image classification, there are binary/multi-class classification, multi-label classification, and ordinal regression.As a result, it needs large amounts of engineering effort to tune the deep learning models in real practice.On the other hand, it is not easy to identify whether a specific model design could be generalizable if it is only evaluated on a few datasets.Large and diverse datasets are urged by the research communities to fairly evaluate generalization performance of models.
Benchmarking data-driven approaches on various domains has been addressed by researchers.Visual Domain Decathlon (VDD) 4 develops an evaluation protocol on 10 existing natural image datasets to assess the model generalizability on different domains.In medical imaging area, Medical Segmentation Decathlon (MSD) 5 introduces 10 3D medical image segmentation datasets to evaluate end-to-end segmentation performance: from whole 3D volumes to targets.It is particularly important to understand the end-to-end performance of the current state of the art with MSD.However, the contribution of each part in the end-to-end systems could be particularly hard to analyze.As reported in the winning solutions 6,7 , hyperparameter tuning, pre/post-processing, model ensemble strategies and training/test-time augmentation could be more important than the machine learning part (e.g., model architectures, learning scheme).Therefore, a large but simple dataset focusing on the machine learning part like VDD, rather than the end-to-end system like MSD, will serve as a better benchmark to evaluate the generalization performance of the machine learning algorithms on the medical image analysis tasks.
In this study, we aim at a new "decathlon" dataset for biomedical image analysis, named MedMNIST v2.As illustrated in Figure 1, MedMNIST v2 is a large-scale benchmark for 2D and 3D biomedical image classification, covering 12 2D datasets with 708,069 images and 6 3D datasets with 9,998 images.It is designed to be: • Diverse: It covers diverse data modalities, dataset scales (from 100 to 100,000), and tasks (binary/multi-class, multi-label, and ordinal regression).It is as diverse as the VDD 4 and MSD 5 to fairly evaluate the generalizable performance of machine learning algorithms in different settings, but both 2D and 3D biomedical images are provided.
• Standardized: Each sub-dataset is pre-processed into the same format (see details in Methods), which requires no background knowledge for users.As an MNIST-like 8 dataset collection to perform classification tasks on small images, it primarily focuses on the machine learning part rather than the end-to-end system.Furthermore, we provide standard train-validation-test splits for all datasets in MedMNIST v2, therefore algorithms could be easily compared.
• Educational: As an interdisciplinary research area, biomedical image analysis is difficult to hand on for researchers from other communities, as it requires background knowledge from computer vision, machine learning, biomedical imaging, and clinical science.Our data with the Creative Commons (CC) License is easy to use for educational purposes.
MedMNIST v2 is extended from our preliminary version, MedMNIST v1 9 , with 10 2D datasets for medical image classification.As MedMNIST v1 is more medical-oriented, we additionally provide 2 2D bioimage datasets.Considering the popularity of 3D imaging in biomedical area, we carefully develop 6 3D datasets following the same design principle as 2D ones.A comparison of the "decathlon" datasets could be found in Table 1.We benchmark several standard deep learning methods and AutoML tools with MedMNIST v2 on both 2D and 3D datasets, including ResNets 10 with early-stopping strategies on validation set, open-source AutoML tools (auto-sklearn 11 and AutoKeras 12 ) and a commercial AutoML tool, Google AutoML Vision (for 2D only).All benchmark experiments are repeated at least 3 times for more stable results than in MedMNIST v1.Besides, the code for MedMNIST has been refactored to make it more friendly to use.
As a large-scale benchmark in biomedical image analysis, MedMNIST has been particularly useful for machine learning and computer vision research [13][14][15] , e.g., AutoML, trustworthy machine learning, domain adaptive learning.Moreover, considering the scarcity of 3D image classification datasets, the MedMNIST3D in MedMNIST v2 from diverse backgrounds could benefit research in 3D computer vision.

Design Principles
The MedMNIST v2 dataset consists of 12 2D and 6 3D standardized datasets from carefully selected sources covering primary data modalities (e.g., X-ray, OCT, ultrasound, CT, electron microscope), diverse classification tasks (binary/multi-class, ordinal regression, and multi-label) and dataset scales (from 100 to 100,000).We illustrate the landscape of MedMNIST v2 in Figure 2.
As it is hard to categorize the data modalities, we use the imaging resolution instead to represent the modality.The diverse dataset design could lead to diverse task difficulty, which is desirable as a biomedical image classification benchmark.
Although it is fair to compare performance on the test set only, it could be expensive to compare the impact of the train-validation split.Therefore, we provide an official train-validation-test split for each subset.We use the official data split from source dataset (if provided) to avoid data leakage.If the source dataset has only a split of training and validation set, we use the official validation set as test set and split the official training set with a ratio of 9:1 into training-validation.For the dataset without an official split, we split the dataset randomly at the patient level with a ratio of 7:1:2 into training-validation-test.All images are pre-processed into a MNIST-like format, i.e., 28 × 28 (2D) or 28 × 28 × 28 (3D), with cubic spline interpolation operation for image resizing.The MedMNIST uses the classification labels from the source datasets directly in most cases, but the labels could be simplified (merged or deleted classes) if the classification tasks on the small images are too difficult.All source datasets are either associated with the Creative Commons (CC) Licenses or developed by us, which allows us to develop derivative datasets based on them.Some datasets are under CC-BY-NC license; we have contacted the authors and obtained the permission to re-distribute the datasets.
We list the details of all datasets in Table 2.For simplicity, we call the collection of all 2D datasets as MedMNIST2D, and that of 3D as MedMNIST3D.In the next sections, we will describe how each dataset is created.

PathMNIST
The PathMNIST is based on a prior study 16,17 for predicting survival from colorectal cancer histology slides, providing a dataset

ChestMNIST
The ChestMNIST is based on the NIH-ChestXray14 dataset 18 , a dataset comprising 112, 120 frontal-view X-Ray images of 30, 805 unique patients with the text-mined 14 disease labels, which could be formulized as a multi-label binary-class classification task.We use the official data split, and resize the source images of 1 × 1, 024 × 1, 024 into 1 × 28 × 28.

DermaMNIST
The DermaMNIST is based on the HAM10000 [19][20][21] , a large collection of multi-source dermatoscopic images of common pigmented skin lesions.The dataset consists of 10, 015 dermatoscopic images categorized as 7 different diseases, formulized as a multi-class classification task.We split the images into training, validation and test set with a ratio of 7 : 1 : 2. The source images of 3 × 600 × 450 are resized into 3 × 28 × 28.

OCTMNIST
The OCTMNIST is based on a prior dataset 22,23 of 109, 309 valid optical coherence tomography (OCT) images for retinal diseases.The dataset is comprised of 4 diagnosis categories, leading to a multi-class classification task.We split the source training set with a ratio of 9 : 1 into training and validation set, and use its source validation set as the test set.The source images are gray-scale, and their sizes are (384 − 1, 536) × (277 − 512).We center-crop the images with a window size of length of the short edge and resize them into 1 × 28 × 28.

PneumoniaMNIST
The PneumoniaMNIST is based on a prior dataset 22,23 of 5, 856 pediatric chest X-Ray images.The task is binary-class classification of pneumonia against normal.We split the source training set with a ratio of 9 : 1 into training and validation set, and use its source validation set as the test set.The source images are gray-scale, and their sizes are (384 − 2, 916) × (127 − 2, 713).We center-crop the images with a window size of length of the short edge and resize them into 1 × 28 × 28.

RetinaMNIST
The RetinaMNIST is based on the DeepDRiD 24 challenge, which provides a dataset of 1, 600 retina fundus images.The task is ordinal regression for 5-level grading of diabetic retinopathy severity.We split the source training set with a ratio of 9 : 1 into training and validation set, and use the source validation set as the test set.The source images of 3 × 1, 736 × 1, 824 are center-cropped with a window size of length of the short edge and resized into 3 × 28 × 28.

BreastMNIST
The BreastMNIST is based on a dataset 25 of 780 breast ultrasound images.It is categorized into 3 classes: normal, benign, and malignant.As we use low-resolution images, we simplify the task into binary classification by combining normal and benign as positive and classifying them against malignant as negative.We split the source dataset with a ratio of 7 : 1 : 2 into training, validation and test set.The source images of 1 × 500 × 500 are resized into 1 × 28 × 28.

BloodMNIST
The BloodMNIST is based on a dataset 26,27

TissueMNIST
We use the BBBC051 34 , available from the Broad Bioimage Benchmark Collection 28 .The dataset contains 236, 386 human kidney cortex cells, segmented from 3 reference tissue specimens and organized into 8 categories.We split the source dataset with a ratio of 7 : 1 : 2 into training, validation and test set.Each gray-scale image is 32 × 32 × 7 pixels, where 7 denotes 7 slices.We obtain 2D maximum projections by taking the maximum pixel value along the axial-axis of each pixel, and resize them into 28 × 28 gray-scale images.

Organ{A,C,S}MNIST
The Organ{A,C,S}MNIST is based on 3D computed tomography (CT) images from Liver Tumor Segmentation Benchmark (LiTS) 29 .They are renamed from OrganMNIST_{Axial,Coronal,Sagittal} (in MedMNIST v1 9 ) for simplicity.We use boundingbox annotations of 11 body organs from another study 30 to obtain the organ labels.Hounsfield-Unit (HU) of the 3D images are transformed into gray-scale with an abdominal window.We crop 2D images from the center slices of the 3D bounding boxes in axial / coronal / sagittal views (planes).The

OrganMNIST3D
The source of the OrganMNIST3D is the same as that of the Organ{A,C,S}MNIST.Instead of 2D images, we directly use the 3D bounding boxes and process the images into 28

NoduleMNIST3D
The NoduleMNIST3D is based on the LIDC-IDRI 31 , a large public lung nodule dataset, containing images from thoracic CT scans.The dataset is designed for both lung nodule segmentation and 5-level malignancy classification task.To perform binary classification, we categorize cases with malignancy level 1/2 into negative class and 4/5 into positive class, ignoring the cases with malignancy level 3. We split the source dataset with a ratio of 7 : 1 : 2 into training, validation and test set, and center-crop the spatially normalized images (with a spacing of 1mm × 1mm × 1mm) into 28 × 28 × 28.

AdrenalMNIST3D
The AdrenalMNIST3D is a new 3D shape classification dataset, consisting of shape masks from 1,584 left and right adrenal glands (i.e., 792 patients).Collected from Zhongshan Hospital Affiliated to Fudan University, each 3D shape of adrenal gland is annotated by an expert endocrinologist using abdominal computed tomography (CT), together with a binary classification label of normal adrenal gland or adrenal mass.Considering patient privacy, we do not provide the source CT scans, but the real 3D shapes of adrenal glands and their classification labels.We calculate the center of adrenal and resize the center-cropped 64mm × 64mm × 64mm volume into 28 × 28 × 28.The dataset is randomly split into training / validation / test set of 1,188 / 98 / 298 on a patient level.

FractureMNIST3D
The FractureMNIST3D is based on the RibFrac Dataset 32 , containing around 5, 000 rib fractures from 660 computed tomography (CT) scans.The dataset organizes detected rib fractures into 4 clinical categories (i.e., buckle, nondisplaced, displaced, and segmental rib fractures).As we use low-resolution images, we disregard segmental rib fractures and classify 3 types of rib fractures (i.e., buckle, nondisplaced, and displaced).For each annotated fracture area, we calculate its center and resize the center-cropped 64mm × 64mm × 64mm image into 28 × 28 × 28.The official split of training, validation and test set is used.

VesselMNIST3D
The VesselMNIST3D is based on an open-access 3D intracranial aneurysm dataset, IntrA 33 , containing 103 3D models (meshes) of entire brain vessels collected by reconstructing MRA images.1, 694 healthy vessel segments and 215 aneurysm segments are generated automatically from the complete models.We fix the non-watertight mesh with PyMeshFix 35 and voxelize the watertight mesh with trimesh 36 into 28 × 28 × 28 voxels.We split the source dataset with a ratio of 7 : 1 : 2 into training, validation and test set.

SynapseMNIST3D
The SynapseMNIST3D is a new 3D volume dataset to classify whether a synapse is excitatory or inhibitory.It uses a 3D image volume of an adult rat acquired by a multi-beam scanning electron microscope.The original data is of the size 100 × 100 × 100um 3 and the resolution 8 × 8 × 30nm 3 , where a (30um) 3 sub-volume was used in the MitoEM dataset 37 with dense 3D mitochondria instance segmentation labels.Three neuroscience experts segment a pyramidal neuron within the whole volume and proofread all the synapses on this neuron with excitatory / inhibitory labels.For each labeled synaptic location, we crop a 3D volume of 1024 × 1024 × 1024nm 3 and resize it into 28 × 28 × 28 voxels.Finally, the dataset is randomly split with a ratio of 7 : 1 : 2 into training, validation and test set.
• "{train,val,test}_labels": an array containing ground-truth labels, with a shape of N × 1 for multi-class / binary-class / ordinal regression datasets, of N × L for multi-lable binary-class datasets.N denotes the number of samples in training / validation / test set and L denotes the number of task labels in the multi-label dataset (i.e., 14 for the ChestMNIST).

Technical Validation Baseline Methods
For MedMNIST2D, we first implement ResNets 10 with a simple early-stopping strategy on validation set as baseline methods.
The ResNet model contains 4 residual layers and each layer has several blocks, which is a stack of convolutional layers, batch normalization and ReLU activation.The input channel is always 3 since we convert gray-scale images into RGB images.To fairly compare with other methods, the input resolutions are 28 or 224 (resized from 28) for the ResNet-18 and ResNet-50.For all model training, we use cross entropy-loss and set the batch size as 128.We utilize an Adam optimizer 40 with an initial learning rate of 0.001 and train the model for 100 epochs, delaying the learning rate by 0.1 after 50 and 75 epochs.For MedMNIST3D, we implement ResNet-18 / ResNet-50 10 with 2.5D / 3D / ACS 41 convolutions with a simple earlystopping strategy on validation set as baseline methods, using the one-line 2D neural network converters provided in the official 6/11 ACS code repository (https://github.com/M3DV/ACSConv).When loading the datasets, we copy the single channel into 3 channels to make it compatible.For all model training, we use cross-entropy loss and set the batch size as 32.We utilize an Adam optimizer 40 with an initial learning rate of 0.001 and train the model for 100 epochs, delaying the learning rate by 0.1 after 50 and 75 epochs.Additionally, as a regularization for the two datasets of shape modality (i.e., AdrenalMNIST3D / VesselMNIST3D), we multiply the training set by a random value in [0, 1] during training and multiply the images by a fixed coefficient of 0.5 during evaluation.
The details of model implementation and training scheme can be found in our code.

AutoML Methods
We have also selected several AutoML methods: auto-sklearn 11 as the representative of open-source AutoML tools for statistical machine learning, AutoKeras 12 as the representative of open-source AutoML tools for deep learning, and Google AutoML Vision as the representative of commercial black-box AutoML tools, with deep learning empowered.We run auto-sklearn 11 and AutoKeras 12 on both MedMNIST2D and MedMNIST3D, and Google AutoML Vision on MedMNIST2D only.auto-sklearn 11 automatically searches the algorithms and hyper-parameters in scikit-learn 42 package.We set time limit for search of appropriate models according to the dataset scale.The time limit is 2 hours for 2D datasets with scale < 10, 000, 4 hours for those of [10, 000, 50, 000], and 6 hours for those > 50, 000.For 3D datasets, we set time limit as 4 hours.We flatten the images into one dimension, and provide reshaped one-dimensional data with the corresponding labels for auto-sklearn to fit.
AutoKeras 12 based on Keras package 43 searches deep neural networks and hyper-parameters.For each dataset, we set number of max_trials as 20 and number of epochs as 20.It tries 20 different Keras models and trains each model for 20 epochs.We choose the best model based on the highest AUC score on validation set.
Google AutoML Vision (https://cloud.google.com/vision/automl/docs,experimented in July, 2021) is a commercial AutoML tool offered as a service from Google Cloud.We train Edge exportable models of MedMNIST2D on Google AutoML Vision and export trained quantized models into TensorFlow Lite format to do offline inference.We set number of node hours of each dataset according to the data scale.We allocate 1 node hour for dataset with scale around 1, 000, 2 node hours for scale around 10, 000, 3 node hours for scale around 100, 000, and 4 node hours for scale around 200, 000.

Evaluation
Area under ROC curve (AUC) 44 and Accuracy (ACC) are used as the evaluation metrics.AUC is a threshold-free metric to evaluate the continuous prediction scores, while ACC evaluates the discrete prediction labels given threshold (or arg max).AUC is less sensitive to class imbalance than ACC.Since there is no severe class imbalance on our datasets, ACC could also serve as a good metric.Although there are many other metrics, we simply select AUC and ACC for the sake of simplicity and standardization of evaluation.We report the AUC and ACC for each dataset.Data users are also encouraged to analyze the average performance over the 12 2D datasets and 6 3D datasets to benchmark their methods.Thereby, we report average AUC and ACC score over MedMNIST2D and MedMNIST3D respectively to easily compare the performance of different methods.

Benchmark on Each Dataset
The performance on each dataset of MedMNIST2D and MedMNIST3D is reported in Table 3 and Table 4, respectively.We calculate the mean value of at least 3 trials for each method on each dataset.
For 2D datasets, Google AutoML Vision is well-performing in general, however it could not always win, even compared with the baseline ResNet-18 and ResNet-50.Auto-sklearn performs poorly on most datasets, indicating that the typical statistical machine learning algorithms do not work well on our 2D medical image datasets.AutoKeras performs well on datasets with large scales, however relatively worse on datasets with small scale.With the same depth of ResNet backbone, datasets of resolution 224 outperform resolution 28 in general.For datasets of resolution 28, ResNet-18 wins higher scores than ResNet-50 on most datasets.
For 3D datasets, AutoKeras does not work well, while auto-sklearn performs better than on MedMNIST2D.Auto-sklearn is superior to ResNet-18+2.5Dand ResNet-50+2.5D in general, and even outperforms all the other methods in ACC score on AdrenalMNIST3D.2.5D models have poorer performance compared with 3D and ACS models, while 3D and ACS models are comparable to each other.With 3D convolution, ResNet-50 backbone surpasses ResNet-18.

Average Performance of Each Method
To compare the performance of various methods, we report the average AUC and average ACC of each method over all datasets.The average performance of methods on MedMNIST2D and MedMNIST3D are reported in Table 5 and Table 6, respectively.Despite the great gap among the metrics of different sub-datasets, the average AUC and ACC could still manifest the performance of each method.
For MedMNIST2D, Google AutoML Vision outperforms all the other methods in average AUC, however, it is very close to the performance of baseline ResNets.The ResNets surpass auto-sklearn and AutoKeras, and outperform Google AutoML Vision in average ACC.Under the same backbone, the datasets with resolution of 224 win higher AUC and ACC score than resolution of 28.While under the same resolution, ResNet-18 is superior to ResNet-50.For MedMNIST3D, AutoKeras does not perform well, performing worse than auto-sklearn.Under the same ResNet backbone, 2.5D models are inferior to 3D and ACS models and perform worse than auto-sklearn and AutoKeras.Surprisingly, the ResNet-50 with standard 3D convolution outperforms all the other methods on average.

Difference between Organ{A,C,S}MNIST and OrganMNIST3D
Organ{A,C,S}MNIST and OrganMNIST3D are generated from the same source dataset, and share the same task and the same data split.However, samples in the 2D and 3D datasets are different.Organ{A,C,S}MNIST are sampled slices of 3D bounding boxes of 3D CT images in axial / coronal / sagittal views (planes), respectively.They are sliced before being resized into 1 × 28 × 28.On the other hand, OrganMNIST3D is resized into 28 × 28 × 28 directly.Therefore, the Organ{A,C,S}MNIST metrics in Table 3 and the OrganMNIST3D metrics in Table 4 should not be compared.
We perform experiments to clarify the difference between Organ{A,C,S}MNIST and OrganMNIST3D.We slice the OrganMNIST3D dataset in the axial / coronal / sagittal views (planes) respectively to generate the central slices.For each view, we take the 60% central slices when slicing and discard the other 40% slices.We evaluate the model performance on the OrganMNIST3D, with 2D-input ResNet-18 trained with Organ{A,C,S}MNIST and the axial / coronal / sagittal central slices of OrganMNIST3D, as well as 3D-input ResNet-18.The results are reported in Table 7.The performance of 3D-input models is comparable to that of 2D-input models with axial view in general.In other words, with an appropriate setting, the 2D inputs and 3D inputs are comparable on the OrganMNIST3D dataset.

Usage Notes
The MedMNIST can be freely available at https://medmnist.com/.We would be grateful if the users of MedMNIST dataset could cite MedMNIST v1 9 and v2 (this paper), as well as the corresponding source dataset in the publications.Please note that this dataset is NOT intended for clinical use, as substantially reducing the resolution of medical images might result in images that are insufficient to represent and capture different disease pathologies.

Figure 1 .
Figure 1.An overview of MedMNIST v2.MedMNIST is a large-scale MNIST-like collection of standardized 2D and 3D biomedical images with classification labels.It is designed to be diverse, standardized, educational, and lightweight, which could support numerous research / educational purposes.

Figure 2 .
Figure 2. The landscape of MedMNIST v2.The horizontal axis denotes the base-10 logarithm of the dataset scale, and the vertical axis denotes base-10 logarithm of imaging resolution.The upward and downward triangles are used to distinguish between 2D datasets and 3D datasets, and the 4 different colors represent different tasks.
(NCT-CRC-HE-100K) of 100, 000 non-overlapping image patches from hematoxylin & eosin stained histological images, and a test dataset (CRC-VAL-HE-7K) of 7, 180 image patches from a different clinical center.The dataset is comprised of 9 types of tissues, resulting in a multi-class classification task.We resize the source images of 3 × 224 × 224 into 3 × 28 × 28, and split NCT-CRC-HE-100K into training and validation set with a ratio of 9 : 1.The CRC-VAL-HE-7K is treated as the test set.
only differences of Organ{A,C,S}MNIST are the views.The images are resized into 1 × 28 × 28 to perform multi-class classification of 11 body organs.115 and 16 CT scans from the source training set are used as training and validation set, respectively.The 70 CT scans from the source test set are treated as the test set.

× 28 ×
28 to perform multi-class classification of 11 body organs.The same 115 and 16 CT scans as the Organ{A,C,S}MNIST from the source training set are used as training and validation set, respectively, and the same 70 CT scans as the Organ{A,C,S}MNIST from the source test set are treated as the test set.

Table 2 . Data summary of MedMNIST v2 dataset, including data source, data modality, type of the classification task together with the number of classes for multi-class or that of labels for multi-label, number of samples in total and in each data split (training/validation/test). Upper: MedMNIST2D, 12 datasets of 2D images. Lower: MedMNIST3D, 6 datasets of 3D images
. MC: Multi-Class.BC: Binary-Class.ML: Multi-Label.OR: Ordinal Regression.
of individual normal cells, captured from individuals without infection, hematologic or oncologic disease and free of any pharmacologic treatment at the moment of blood collection.It contains a total of 17,092 images and is organized into 8 classes.We split the source dataset with a ratio of 7 : 1 : 2 into training, validation and test set.The source images with resolution 3 × 360 × 363 pixels are center-cropped into 3 × 200 × 200, and then resized into 3 × 28 × 28.

Table 3 .
Benchmark on each dataset of MedMNIST2D in metrics of AUC and ACC.

Table 4 .
Benchmark on each dataset of MedMNIST3D in metrics of AUC and ACC.

Table 5 .
Average performance of MedMNIST2D in metrics of average AUC and average ACC over all 2D datasets.

Table 6 .
Average performance of MedMNIST3D in metrics of average AUC and average ACC over all 3D datasets.