MedMNIST v2 - A large-scale lightweight benchmark for 2D and 3D biomedical image classification

Yang, Jiancheng; Shi, Rui; Wei, Donglai; Liu, Zequan; Zhao, Lin; Ke, Bilian; Pfister, Hanspeter; Ni, Bingbing

doi:10.1038/s41597-022-01721-8

Download PDF

Data Descriptor
Open access
Published: 19 January 2023

MedMNIST v2 - A large-scale lightweight benchmark for 2D and 3D biomedical image classification

Jiancheng Yang ORCID: orcid.org/0000-0003-4455-7145¹,
Rui Shi¹,
Donglai Wei²,
Zequan Liu³,
Lin Zhao⁴,
Bilian Ke⁵,
Hanspeter Pfister⁶ &
…
Bingbing Ni¹

Scientific Data volume 10, Article number: 41 (2023) Cite this article

27k Accesses
94 Citations
6 Altmetric
Metrics details

Subjects

Abstract

We introduce MedMNIST v2, a large-scale MNIST-like dataset collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into a small size of 28 × 28 (2D) or 28 × 28 × 28 (3D) with the corresponding classification labels so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST v2 is designed to perform classification on lightweight 2D and 3D images with various dataset scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression, and multi-label). The resulting dataset, consisting of 708,069 2D images and 9,998 3D images in total, could support numerous research/educational purposes in biomedical image analysis, computer vision, and machine learning. We benchmark several baseline methods on MedMNIST v2, including 2D/3D neural networks and open-source/commercial AutoML tools. The data and code are publicly available at https://medmnist.com/.

Measurement(s)	supervised machine learning
Technology Type(s)	machine learning

Self-supervised learning for medical image classification: a systematic review and implementation guidelines

Article Open access 26 April 2023

DeepImageJ: A user-friendly environment to run deep learning models in ImageJ

Article 30 September 2021

Code-free deep learning for multi-modality medical image classification

Article Open access 01 March 2021

Background & Summary

Deep learning based biomedical image analysis plays an important role in the intersection of artificial intelligence and healthcare^1,2,3. Is deep learning a panacea in this area? Because of the inherent complexity in biomedicine, data modalities, dataset scales and tasks in biomedical image analysis could be highly diverse. Numerous biomedical imaging modalities are designed for specific purposes by adjusting sensors and imaging protocols. The biomedical image dataset scales in biomedical image analysis could range from 100 to 100,000. Moreover, even only considering medical image classification, there are binary/multi-class classification, multi-label classification, and ordinal regression. As a result, it needs large amounts of engineering effort to tune the deep learning models in real practice. On the other hand, it is not easy to identify whether a specific model design could be generalizable if it is only evaluated on a few datasets. Large and diverse datasets are urged by the research communities to fairly evaluate generalization performance of models.

Benchmarking data-driven approaches on various domains has been addressed by researchers. Visual Domain Decathlon (VDD)⁴ develops an evaluation protocol on 10 existing natural image datasets to assess the model generalizability on different domains. In medical imaging area, Medical Segmentation Decathlon (MSD)⁵ introduces 10 3D medical image segmentation datasets to evaluate end-to-end segmentation performance: from whole 3D volumes to targets. It is particularly important to understand the end-to-end performance of the current state of the art with MSD. However, the contribution of each part in the end-to-end systems could be particularly hard to analyze. As reported in the winning solutions^6,7, hyperparameter tuning, pre/post-processing, model ensemble strategies and training/test-time augmentation could be more important than the machine learning part (e.g., model architectures, learning scheme). Therefore, a large but simple dataset focusing on the machine learning part like VDD, rather than the end-to-end system like MSD, will serve as a better benchmark to evaluate the generalization performance of the machine learning algorithms on the medical image analysis tasks.

In this study, we aim at a new “decathlon” dataset for biomedical image analysis, named MedMNIST v2. As illustrated in Fig. 1, MedMNIST v2 is a large-scale benchmark for 2D and 3D biomedical image classification, covering 12 2D datasets with 708,069 images and 6 3D datasets with 9,998 images. It is designed to be:

Diverse: It covers diverse data modalities, dataset scales (from 100 to 100,000), and tasks (binary/multi-class, multi-label, and ordinal regression). It is as diverse as the VDD⁴ and MSD⁵ to fairly evaluate the generalizable performance of machine learning algorithms in different settings, but both 2D and 3D biomedical images are provided.
Standardized: Each sub-dataset is pre-processed into the same format (see details in Methods), which requires no background knowledge for users. As an MNIST-like⁸ dataset collection to perform classification tasks on small images, it primarily focuses on the machine learning part rather than the end-to-end system. Furthermore, we provide standard train-validation-test splits for all datasets in MedMNIST v2, therefore algorithms could be easily compared.
Lightweight: The small size of 28 × 28 (2D) or 28 × 28 × 28 (3D) is friendly to evaluate machine learning algorithms.
Educational: As an interdisciplinary research area, biomedical image analysis is difficult to hand on for researchers from other communities, as it requires background knowledge from computer vision, machine learning, biomedical imaging, and clinical science. Our data with the Creative Commons (CC) License is easy to use for educational purposes.

MedMNIST v2 is extended from our preliminary version, MedMNIST v1⁹, with 10 2D datasets for medical image classification. As MedMNIST v1 is more medical-oriented, we additionally provide 2 2D bioimage datasets. Considering the popularity of 3D imaging in biomedical area, we carefully develop 6 3D datasets following the same design principle as 2D ones. A comparison of the “decathlon” datasets could be found in Table 1. We benchmark several standard deep learning methods and AutoML tools with MedMNIST v2 on both 2D and 3D datasets, including ResNets¹⁰ with early-stopping strategies on validation set, open-source AutoML tools (auto-sklearn¹¹ and AutoKeras¹²) and a commercial AutoML tool, Google AutoML Vision (for 2D only). All benchmark experiments are repeated at least 3 times for more stable results than in MedMNIST v1. Besides, the code for MedMNIST has been refactored to make it more friendly to use.

Table 1 A comparison of MedMNIST v2 and other “decathlon” datasets.

Full size table

As a large-scale benchmark in biomedical image analysis, MedMNIST has been particularly useful for machine learning and computer vision research^13,14,15, e.g., AutoML, trustworthy machine learning, domain adaptive learning. Moreover, considering the scarcity of 3D image classification datasets, the MedMNIST3D in MedMNIST v2 from diverse backgrounds could benefit research in 3D computer vision.

Methods

Design principles

The MedMNIST v2 dataset consists of 12 2D and 6 3D standardized datasets from carefully selected sources covering primary data modalities (e.g., X-ray, OCT, ultrasound, CT, electron microscope), diverse classification tasks (binary/multi-class, ordinal regression, and multi-label) and dataset scales (from 100 to 100,000). We illustrate the landscape of MedMNIST v2 in Fig. 2. As it is hard to categorize the data modalities, we use the imaging resolution instead to represent the modality. The diverse dataset design could lead to diverse task difficulty, which is desirable as a biomedical image classification benchmark.

Although it is fair to compare performance on the test set only, it could be expensive to compare the impact of the train-validation split. Therefore, we provide an official train-validation-test split for each subset. We use the official data split from source dataset (if provided) to avoid data leakage. If the source dataset has only a split of training and validation set, we use the official validation set as test set and split the official training set with a ratio of 9:1 into training-validation. For the dataset without an official split, we split the dataset randomly at the patient level with a ratio of 7:1:2 into training-validation-test. All images are pre-processed into a MNIST-like format, i.e., 28 × 28 (2D) or 28 × 28 × 28 (3D), with cubic spline interpolation operation for image resizing. The MedMNIST uses the classification labels from the source datasets directly in most cases, but the labels could be simplified (merged or deleted classes) if the classification tasks on the small images are too difficult. All source datasets are either associated with the Creative Commons (CC) Licenses or developed by us, which allows us to develop derivative datasets based on them. Some datasets are under CC-BY-NC license; we have contacted the authors and obtained the permission to re-distribute the datasets.

We list the details of all datasets in Table 2. For simplicity, we call the collection of all 2D datasets as MedMNIST2D, and that of 3D as MedMNIST3D. In the next sections, we will describe how each dataset is created.

Table 2 Data summary of MedMNIST v2 dataset, including data source, data modality, type of the classification task together with the number of classes for multi-class or that of labels for multi-label, number of samples in total and in each data split (training/validation/test).

Full size table

Details for MedMNIST2D

PathMNIST

The PathMNIST is based on a prior study^16,17 for predicting survival from colorectal cancer histology slides, providing a dataset (NCT-CRC-HE-100K) of 100,000 non-overlapping image patches from hematoxylin & eosin stained histological images, and a test dataset (CRC-VAL-HE-7K) of 7,180 image patches from a different clinical center. The dataset is comprised of 9 types of tissues, resulting in a multi-class classification task. We resize the source images of 3 × 224 × 224 into 3 × 28 × 28, and split NCT-CRC-HE-100K into training and validation set with a ratio of 9:1. The CRC-VAL-HE-7K is treated as the test set.

ChestMNIST

The ChestMNIST is based on the NIH-ChestXray14 dataset¹⁸, a dataset comprising 112,120 frontal-view X-Ray images of 30,805 unique patients with the text-mined 14 disease labels, which could be formulized as a multi-label binary-class classification task. We use the official data split, and resize the source images of 1 × 1,024 × 1,024 into 1 × 28 × 28.

DermaMNIST

The DermaMNIST is based on the HAM10000^19,20,21, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. The dataset consists of 10,015 dermatoscopic images categorized as 7 different diseases, formulized as a multi-class classification task. We split the images into training, validation and test set with a ratio of 7:1:2. The source images of 3 × 600 × 450 are resized into 3 × 28 × 28.

OCTMNIST

The OCTMNIST is based on a prior dataset^22,23 of 109,309 valid optical coherence tomography (OCT) images for retinal diseases. The dataset is comprised of 4 diagnosis categories, leading to a multi-class classification task. We split the source training set with a ratio of 9:1 into training and validation set, and use its source validation set as the test set. The source images are gray-scale, and their sizes are (384–1,536) × (277–512). We center-crop the images with a window size of length of the short edge and resize them into 1 × 28 × 28.

PneumoniaMNIST

The PneumoniaMNIST is based on a prior dataset^22,23 of 5,856 pediatric chest X-Ray images. The task is binary-class classification of pneumonia against normal. We split the source training set with a ratio of 9:1 into training and validation set, and use its source validation set as the test set. The source images are gray-scale, and their sizes are (384–2,916) × (127–2,713). We center-crop the images with a window size of length of the short edge and resize them into 1 × 28 × 28.

RetinaMNIST

The RetinaMNIST is based on the DeepDRiD²⁴ challenge, which provides a dataset of 1,600 retina fundus images. The task is ordinal regression for 5-level grading of diabetic retinopathy severity. We split the source training set with a ratio of 9:1 into training and validation set, and use the source validation set as the test set. The source images of 3 × 1,736 × 1,824 are center-cropped with a window size of length of the short edge and resized into 3 × 28 × 28.

BreastMNIST

The BreastMNIST is based on a dataset²⁵ of 780 breast ultrasound images. It is categorized into 3 classes: normal, benign, and malignant. As we use low-resolution images, we simplify the task into binary classification by combining normal and benign as positive and classifying them against malignant as negative. We split the source dataset with a ratio of 7:1:2 into training, validation and test set. The source images of 1 × 500 × 500 are resized into 1 × 28 × 28.

BloodMNIST

The BloodMNIST is based on a dataset^26,27 of individual normal cells, captured from individuals without infection, hematologic or oncologic disease and free of any pharmacologic treatment at the moment of blood collection. It contains a total of 17,092 images and is organized into 8 classes. We split the source dataset with a ratio of 7:1:2 into training, validation and test set. The source images with resolution 3 × 360 × 363 pixels are center-cropped into 3 × 200 × 200, and then resized into 3 × 28 × 28.

TissueMNIST

We use the BBBC051²⁸, available from the Broad Bioimage Benchmark Collection²⁹. The dataset contains 236,386 human kidney cortex cells, segmented from 3 reference tissue specimens and organized into 8 categories. We split the source dataset with a ratio of 7:1:2 into training, validation and test set. Each gray-scale image is 32 × 32 × 7 pixels, where 7 denotes 7 slices. We obtain 2D maximum projections by taking the maximum pixel value along the axial-axis of each pixel, and resize them into 28 × 28 gray-scale images.

Organ{A,C,S}MNIST

The Organ{A,C,S}MNIST is based on 3D computed tomography (CT) images from Liver Tumor Segmentation Benchmark (LiTS)³⁰. They are renamed from OrganMNIST_{Axial,Coronal,Sagittal} (in MedMNIST v1⁹) for simplicity. We use bounding-box annotations of 11 body organs from another study³¹ to obtain the organ labels. Hounsfield-Unit (HU) of the 3D images are transformed into gray-scale with an abdominal window. We crop 2D images from the center slices of the 3D bounding boxes in axial/coronal/sagittal views (planes). The only differences of Organ{A,C,S}MNIST are the views. The images are resized into 1 × 28 × 28 to perform multi-class classification of 11 body organs. 115 and 16 CT scans from the source training set are used as training and validation set, respectively. The 70 CT scans from the source test set are treated as the test set.

Details for MedMNIST3D

OrganMNIST3D

The source of the OrganMNIST3D is the same as that of the Organ{A,C,S}MNIST. Instead of 2D images, we directly use the 3D bounding boxes and process the images into 28 × 28 × 28 to perform multi-class classification of 11 body organs. The same 115 and 16 CT scans as the Organ{A,C,S}MNIST from the source training set are used as training and validation set, respectively, and the same 70 CT scans as the Organ{A,C,S}MNIST from the source test set are treated as the test set.

NoduleMNIST3D

The NoduleMNIST3D is based on the LIDC-IDRI³², a large public lung nodule dataset, containing images from thoracic CT scans. The dataset is designed for both lung nodule segmentation and 5-level malignancy classification task. To perform binary classification, we categorize cases with malignancy level 1/2 into negative class and 4/5 into positive class, ignoring the cases with malignancy level 3. We split the source dataset with a ratio of 7:1:2 into training, validation and test set, and center-crop the spatially normalized images (with a spacing of 1 mm × 1 mm × 1 mm) into 28 × 28 × 28.

AdrenalMNIST3D

The AdrenalMNIST3D is a new 3D shape classification dataset, consisting of shape masks from 1,584 left and right adrenal glands (i.e., 792 patients). Collected from Zhongshan Hospital Affiliated to Fudan University, each 3D shape of adrenal gland is annotated by an expert endocrinologist using abdominal computed tomography (CT), together with a binary classification label of normal adrenal gland or adrenal mass. Considering patient privacy, we do not provide the source CT scans, but the real 3D shapes of adrenal glands and their classification labels. We calculate the center of adrenal and resize the center-cropped 64 mm × 64 mm × 64 mm volume into 28 × 28 × 28. The dataset is randomly split into training/validation/test set of 1,188/98/298 on a patient level.

FractureMNIST3D

The FractureMNIST3D is based on the RibFrac Dataset³³, containing around 5,000 rib fractures from 660 computed tomography (CT) scans. The dataset organizes detected rib fractures into 4 clinical categories (i.e., buckle, nondisplaced, displaced, and segmental rib fractures). As we use low-resolution images, we disregard segmental rib fractures and classify 3 types of rib fractures (i.e., buckle, nondisplaced, and displaced). For each annotated fracture area, we calculate its center and resize the center-cropped 64 mm × 64 mm × 64 mm image into 28 × 28 × 28. The official split of training, validation and test set is used.

VesselMNIST3D

The VesselMNIST3D is based on an open-access 3D intracranial aneurysm dataset, IntrA³⁴, containing 103 3D models (meshes) of entire brain vessels collected by reconstructing MRA images. 1,694 healthy vessel segments and 215 aneurysm segments are generated automatically from the complete models. We fix the non-watertight mesh with PyMeshFix³⁵ and voxelize the watertight mesh with trimesh³⁶ into 28 × 28 × 28 voxels. We split the source dataset with a ratio of 7:1:2 into training, validation and test set.

SynapseMNIST3D

The SynapseMNIST3D is a new 3D volume dataset to classify whether a synapse is excitatory or inhibitory. It uses a 3D image volume of an adult rat acquired by a multi-beam scanning electron microscope. The original data is of the size 100 × 100 × 100 um³ and the resolution 8 × 8 × 30 nm³, where a (30um)³ sub-volume was used in the MitoEM dataset³⁷ with dense 3D mitochondria instance segmentation labels. Three neuroscience experts segment a pyramidal neuron within the whole volume and proofread all the synapses on this neuron with excitatory/inhibitory labels. For each labeled synaptic location, we crop a 3D volume of 1024 × 1024 × 1024 nm³ and resize it into 28 × 28 × 28 voxels. Finally, the dataset is randomly split with a ratio of 7:1:2 into training, validation and test set.

Data Records

The data files of MedMNIST v2 dataset can be accessed at Zenodo³⁸. It contains 12 pre-processed 2D datasets (MedMNIST2D) and 6 pre-processed 3D datasets (MedMNIST3D). Each subset is saved in NumPy³⁹ npz format, named as <data> mnist.npz for MedMNIST2D and <data> mnist3d.npz for MedMNIST3D, and is comprised of 6 keys (“train_images”, “train_labels”, “val_images”, “val_labels”, “test_images”, “test_labels”). The data type of the dataset is uint8.

“{train,val,test}_images”: an array containing images, with a shape of N × 28 × 28 for 2D gray-scale datasets, of N × 28 × 28 × 3 for 2D RGB datasets, of N × 28 × 28 × 28 for 3D datasets. N denotes the number of samples in training/validation/test set.
“{train,val,test}_labels”: an array containing ground-truth labels, with a shape of N × 1 for multi-class/binary-class/ordinal regression datasets, of N × L for multi-lable binary-class datasets. N denotes the number of samples in training/validation/test set and L denotes the number of task labels in the multi-label dataset (i.e., 14 for the ChestMNIST).

Technical Validation

Baseline methods

For MedMNIST2D, we first implement ResNets¹⁰ with a simple early-stopping strategy on validation set as baseline methods. The ResNet model contains 4 residual layers and each layer has several blocks, which is a stack of convolutional layers, batch normalization and ReLU activation. The input channel is always 3 since we convert gray-scale images into RGB images. To fairly compare with other methods, the input resolutions are 28 or 224 (resized from 28) for the ResNet-18 and ResNet-50. For all model training, we use cross entropy-loss and set the batch size as 128. We utilize an Adam optimizer⁴⁰ with an initial learning rate of 0.001 and train the model for 100 epochs, delaying the learning rate by 0.1 after 50 and 75 epochs.

For MedMNIST3D, we implement ResNet-18/ResNet-50¹⁰ with 2.5D/3D/ACS⁴¹ convolutions with a simple early-stopping strategy on validation set as baseline methods, using the one-line 2D neural network converters provided in the official ACS code repository (https://github.com/M3DV/ACSConv). When loading the datasets, we copy the single channel into 3 channels to make it compatible. For all model training, we use cross-entropy loss and set the batch size as 32. We utilize an Adam optimizer⁴⁰ with an initial learning rate of 0.001 and train the model for 100 epochs, delaying the learning rate by 0.1 after 50 and 75 epochs. Additionally, as a regularization for the two datasets of shape modality (i.e., AdrenalMNIST3D/VesselMNIST3D), we multiply the training set by a random value in [0, 1] during training and multiply the images by a fixed coefficient of 0.5 during evaluation.

The details of model implementation and training scheme can be found in our code.

AutoML Methods

We have also selected several AutoML methods: auto-sklearn¹¹ as the representative of open-source AutoML tools for statistical machine learning, AutoKeras¹² as the representative of open-source AutoML tools for deep learning, and Google AutoML Vision as the representative of commercial black-box AutoML tools, with deep learning empowered. We run auto-sklearn¹¹ and AutoKeras¹² on both MedMNIST2D and MedMNIST3D, and Google AutoML Vision on MedMNIST2D only.

auto-sklearn¹¹ automatically searches the algorithms and hyper-parameters in scikit-learn⁴² package. We set time limit for search of appropriate models according to the dataset scale. The time limit is 2 hours for 2D datasets with scale <10,000, 4 hours for those of [10,000,50,000], and 6 hours for those >50,000. For 3D datasets, we set time limit as 4 hours. We flatten the images into one dimension, and provide reshaped one-dimensional data with the corresponding labels for auto-sklearn to fit.

AutoKeras¹² based on Keras package⁴³ searches deep neural networks and hyper-parameters. For each dataset, we set number of max_trials as 20 and number of epochs as 20. It tries 20 different Keras models and trains each model for 20 epochs. We choose the best model based on the highest AUC score on validation set.

Google AutoML Vision (https://cloud.google.com/vision/automl/docs, experimented in July, 2021) is a commercial AutoML tool offered as a service from Google Cloud. We train Edge exportable models of MedMNIST2D on Google AutoML Vision and export trained quantized models into TensorFlow Lite format to do offline inference. We set number of node hours of each dataset according to the data scale. We allocate 1 node hour for dataset with scale around 1,000, 2 node hours for scale around 10,000, 3 node hours for scale around 100,000, and 4 node hours for scale around 200,000.

Evaluation

Area under ROC curve (AUC)⁴⁴ and Accuracy (ACC) are used as the evaluation metrics. AUC is a threshold-free metric to evaluate the continuous prediction scores, while ACC evaluates the discrete prediction labels given threshold (or argmax). AUC is less sensitive to class imbalance than ACC. Since there is no severe class imbalance on our datasets, ACC could also serve as a good metric. Although there are many other metrics, we simply select AUC and ACC for the sake of simplicity and standardization of evaluation. We report the AUC and ACC for each dataset. Data users are also encouraged to analyze the average performance over the 12 2D datasets and 6 3D datasets to benchmark their methods. Thereby, we report average AUC and ACC score over MedMNIST2D and MedMNIST3D respectively to easily compare the performance of different methods.

Benchmark on each dataset

The performance on each dataset of MedMNIST2D and MedMNIST3D is reported in Tables 3 and 4, respectively. We calculate the mean value of at least 3 trials for each method on each dataset.

Table 3 Benchmark on each dataset of MedMNIST2D in metrics of AUC and ACC.

Full size table

Table 4 Benchmark on each dataset of MedMNIST3D in metrics of AUC and ACC.

Full size table

For 2D datasets, Google AutoML Vision is well-performing in general, however it could not always win, even compared with the baseline ResNet-18 and ResNet-50. Auto-sklearn performs poorly on most datasets, indicating that the typical statistical machine learning algorithms do not work well on our 2D medical image datasets. AutoKeras performs well on datasets with large scales, however relatively worse on datasets with small scale. With the same depth of ResNet backbone, datasets of resolution 224 outperform resolution 28 in general. For datasets of resolution 28, ResNet-18 wins higher scores than ResNet-50 on most datasets.

For 3D datasets, AutoKeras does not work well, while auto-sklearn performs better than on MedMNIST2D. Auto-sklearn is superior to ResNet-18 + 2.5D and ResNet-50 + 2.5D in general, and even outperforms all the other methods in ACC score on AdrenalMNIST3D. 2.5D models have poorer performance compared with 3D and ACS models, while 3D and ACS models are comparable to each other. With 3D convolution, ResNet-50 backbone surpasses ResNet-18.

Average performance of each method

To compare the performance of various methods, we report the average AUC and average ACC of each method over all datasets. The average performance of methods on MedMNIST2D and MedMNIST3D are reported in Tables 5 and 6, respectively. Despite the great gap among the metrics of different sub-datasets, the average AUC and ACC could still manifest the performance of each method.

Table 5 Average performance of MedMNIST2D in metrics of average AUC and average ACC over all 2D datasets.

Full size table

Table 6 Average performance of MedMNIST3D in metrics of average AUC and average ACC over all 3D datasets.

Full size table

For MedMNIST2D, Google AutoML Vision outperforms all the other methods in average AUC, however, it is very close to the performance of baseline ResNets. The ResNets surpass auto-sklearn and AutoKeras, and outperform Google AutoML Vision in average ACC. Under the same backbone, the datasets with resolution of 224 win higher AUC and ACC score than resolution of 28. While under the same resolution, ResNet-18 is superior to ResNet-50.

For MedMNIST3D, AutoKeras does not perform well, performing worse than auto-sklearn. Under the same ResNet backbone, 2.5D models are inferior to 3D and ACS models and perform worse than auto-sklearn and AutoKeras. Surprisingly, the ResNet-50 with standard 3D convolution outperforms all the other methods on average.

Difference between Organ{A,C,S}MNIST and OrganMNIST3D

Organ{A,C,S}MNIST and OrganMNIST3D are generated from the same source dataset, and share the same task and the same data split. However, samples in the 2D and 3D datasets are different. Organ{A,C,S}MNIST are sampled slices of 3D bounding boxes of 3D CT images in axial/coronal/sagittal views (planes), respectively. They are sliced before being resized into 1 × 28 × 28. On the other hand, OrganMNIST3D is resized into 28 × 28 × 28 directly. Therefore, the Organ{A,C,S}MNIST metrics in Table 3 and the OrganMNIST3D metrics in Table 4 should not be compared.

We perform experiments to clarify the difference between Organ{A,C,S}MNIST and OrganMNIST3D. We slice the OrganMNIST3D dataset in the axial/coronal/sagittal views (planes) respectively to generate the central slices. For each view, we take the 60% central slices when slicing and discard the other 40% slices. We evaluate the model performance on the OrganMNIST3D, with 2D-input ResNet-18 trained with Organ{A,C,S}MNIST and the axial/coronal/sagittal central slices of OrganMNIST3D, as well as 3D-input ResNet-18. The results are reported in Table 7. The performance of 3D-input models is comparable to that of 2D-input models with axial view in general. In other words, with an appropriate setting, the 2D inputs and 3D inputs are comparable on the OrganMNIST3D dataset.

Table 7 Model performance on OrganMNIST3D test set in various settings, including (upper) 2D-input ResNet-18¹⁰ trained with Organ{A,C,S}MNIST and axial/coronal/sagittal central slices of OrganMNIST3D, and (lower) 3D-input ResNet-18 with 2.5D/3D/ACS⁴¹ convolutions, trained with OrganMNIST3D (same as Table 4).

Full size table

Usage Notes

The MedMNIST can be freely available at https://medmnist.com/. We would be grateful if the users of MedMNIST dataset could cite MedMNIST v1⁹ and v2 (this paper), as well as the corresponding source dataset in the publications.

Please note that this dataset is NOT intended for clinical use, as substantially reducing the resolution of medical images might result in images that are insufficient to represent and capture different disease pathologies.

Code availability

The data API and evaluation script in Python is available at https://github.com/MedMNIST/MedMNIST. The reproducible experiment codebase is available at https://github.com/MedMNIST/experiments.

References

Shen, D., Wu, G. & Suk, H.-I. Deep learning in medical image analysis. Annual review of biomedical engineering 19, 221–248 (2017).
Article CAS Google Scholar
Litjens, G. et al. A survey on deep learning in medical image analysis. Medical image analysis 42, 60–88 (2017).
Article Google Scholar
Liu, X. et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. The lancet digital health 1, e271–e297 (2019).
Article Google Scholar
Rebuffi, S.-A., Bilen, H. & Vedaldi, A. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, 506–516 (2017).
Simpson, A. L. et al. A large annotated medical image dataset for the development and evaluation of segmentation algorithms. Preprint at https://arxiv.org/abs/1902.09063 (2019).
Antonelli, M. et al. The medical segmentation decathlon. Nature communications 13(1), 1-13 (2022).
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J. & Maier-Hein, K. H. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods 18, 203–211 (2021).
Article CAS Google Scholar
LeCun, Y., Cortes, C. & Burges, C. Mnist handwritten digit database. http://yann.lecun.com/exdb/mnist/ (2010).
Yang, J., Shi, R. & Ni, B. Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis. In International Symposium on Biomedical Imaging, 191–195 (2021).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition, 770–778 (2016).
Feurer, M. et al. Auto-sklearn: efficient and robust automated machine learning. In Automated Machine Learning, 113–134 (Springer, Cham, 2019).
Jin, H., Song, Q. & Hu, X. Auto-keras: An efficient neural architecture search system. In Conference on Knowledge Discovery and Data Mining, 1946–1956 (ACM, 2019).
Qi, K. & Yang, H. Elastic net nonparallel hyperplane support vector machine and its geometrical rationality. IEEE Transactions on Neural Networks and Learning Systems (2021).
Chen, K. et al. Alleviating data imbalance issue with perturbed input during inference. In Conference on Medical Image Computing and Computer Assisted Intervention, 407–417 (Springer, 2021).
Henn, T. et al. A principled approach to failure analysis and model repairment: Demonstration in medical imaging. In Conference on Medical Image Computing and Computer Assisted Intervention, 509–518 (Springer, 2021).
Kather, J. N. et al. Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study. PLOS Medicine 16, 1–22, https://doi.org/10.1371/journal.pmed.1002730 (2019).
Article CAS Google Scholar
Kather, J. N., Halama, N. & Marx, A. 100,000 histological images of human colorectal cancer and healthy tissue. Zenodo https://doi.org/10.5281/zenodo.1214456 (2018).
Wang, X. et al. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Conference on Computer Vision and Pattern Recognition, 3462–3471 (2017).
Tschandl, P., Rosendahl, C. & Kittler, H. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data 5, 180161 (2018).
Article Google Scholar
Tschandl, P. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Harvard Dataverse https://doi.org/10.7910/DVN/DBW86T (2018).
Codella, N. et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). Preprint at https://arxiv.org/abs/1902.03368v2 (2019).
Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172, 1122–1131.e9, https://doi.org/10.1016/j.cell.2018.02.010 (2018).
Article CAS Google Scholar
Kermany, D. S., Zhang, K. & Goldbaum, M. Large dataset of labeled optical coherence tomography (oct) and chest x-ray images https://doi.org/10.17632/rscbjbr9sj.3 (2018).
Article Google Scholar
DeepDRiD. The 2nd diabetic retinopathy–grading and image quality estimation challenge. https://isbi.deepdr.org/data.html (2020).
Al-Dhabyani, W., Gomaa, M., Khaled, H. & Fahmy, A. Dataset of breast ultrasound images. Data in Brief 28, 104863, https://doi.org/10.1016/j.dib.2019.104863 (2020).
Article Google Scholar
Acevedo, A. et al. A dataset of microscopic peripheral blood cell images for development of automatic recognition systems. Data in Brief 30, 105474, https://doi.org/10.1016/j.dib.2020.105474 (2020).
Article CAS Google Scholar
Acevedo, A. et al. A dataset for microscopic peripheral blood cell images for development of automatic recognition systems. Mendeley Data https://doi.org/10.17632/snkd93bnjr.1 (2020).
Woloshuk, A. et al. In situ classification of cell types in human kidney tissue using 3d nuclear staining. Cytometry Part A (2020).
Ljosa, V., Sokolnicki, K. L. & Carpenter, A. E. Annotated high-throughput microscopy image sets for validation. Nature methods 9, 637–637 (2012).
Article CAS Google Scholar
Bilic, P. et al. The liver tumor segmentation benchmark (lits). Medical Image Analysis 84,102680 (2023).
Xu, X. et al. Efficient multiple organ localization in ct image using 3d region proposal network. IEEE Transactions on Medical Imaging 38, 1885–1898 (2019).
Article Google Scholar
Armato, S. G. III et al. The lung image database consortium (lidc) and image database resource initiative (idri): A completed reference database of lung nodules on ct scans. Medical Physics 38, 915–931, https://doi.org/10.1118/1.3528204 (2011).
Article ADS Google Scholar
Jin, L. et al. Deep-learning-assisted detection and segmentation of rib fractures from ct scans: Development and validation of fracnet. EBioMedicine 62, 103106, https://doi.org/10.1016/j.ebiom.2020.103106 (2020).
Article Google Scholar
Yang, X., Xia, D., Kin, T. & Igarashi, T. Intra: 3d intracranial aneurysm dataset for deep learning. In Conference on Computer Vision and Pattern Recognition (2020).
Attene, M. A lightweight approach to repairing digitized polygon meshes. The Visual Computer 26, 1393–1406 (2010).
Article Google Scholar
Dawson-Haggerty et al. trimesh. https://trimsh.org/ (2019).
Wei, D. et al. Mitoem dataset: Large-scale 3d mitochondria instance segmentation from em images. In Conference on Medical Image Computing and Computer Assisted Intervention, 66–76 (Springer, 2020).
Yang, J. et al. Medmnist v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification. Zenodo https://doi.org/10.5281/zenodo.5208230 (2021).
Harris, C. R. et al. Array programming with numpy. Nature 585, 357–362 (2020).
Article ADS CAS Google Scholar
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).
Yang, J. et al. Reinventing 2d convolutions for 3d images. IEEE Journal of Biomedical and Health Informatics 1–1, https://doi.org/10.1109/JBHI.2021.3049452 (2021).
Pedregosa, F. et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research 12, 2825–2830 (2011).
MATH Google Scholar
Chollet, F. et al. Keras. https://keras.io (2015).
Bradley, A. P. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition 30, 1145–1159 (1997).
Article ADS Google Scholar

Download references

Acknowledgements

This work was supported by National Science Foundation of China (U20B200011, 61976137). This work was also supported by Grant YG2021ZD18 from Shanghai Jiao Tong University Medical Engineering Cross Research. We would like to acknowledge all authors of the open datasets used in this study.

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Jiancheng Yang, Rui Shi & Bingbing Ni
Boston College, Chestnut Hill, MA, USA
Donglai Wei
RWTH Aachen University, Aachen, Germany
Zequan Liu
Department of Endocrinology and Metabolism, Fudan Institute of Metabolic Diseases, Zhongshan Hospital, Fudan University, Shanghai, China
Lin Zhao
Department of Ophthalmology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
Bilian Ke
Harvard University, Cambridge, MA, USA
Hanspeter Pfister

Authors

Jiancheng Yang
View author publications
You can also search for this author in PubMed Google Scholar
Rui Shi
View author publications
You can also search for this author in PubMed Google Scholar
Donglai Wei
View author publications
You can also search for this author in PubMed Google Scholar
Zequan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Lin Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Bilian Ke
View author publications
You can also search for this author in PubMed Google Scholar
Hanspeter Pfister
View author publications
You can also search for this author in PubMed Google Scholar
Bingbing Ni
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.Y. conceived the experiments. J.Y. and R.S. developed the code and benchmark. J.Y., R.S., D.W., Z.L., L.Z., B.K. and H.P. contributed to data collection, cleaning and annotations. J.Y., R.S., D.W. and B.N. wrote the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Bingbing Ni.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yang, J., Shi, R., Wei, D. et al. MedMNIST v2 - A large-scale lightweight benchmark for 2D and 3D biomedical image classification. Sci Data 10, 41 (2023). https://doi.org/10.1038/s41597-022-01721-8

Download citation

Received: 29 November 2021
Accepted: 26 September 2022
Published: 19 January 2023
DOI: https://doi.org/10.1038/s41597-022-01721-8

This article is cited by

A survey of the impact of self-supervised pretraining for diagnostic tasks in medical X-ray, CT, MRI, and ultrasound
- Blake VanBerlo
- Jesse Hoey
- Alexander Wong
BMC Medical Imaging (2024)
Photonic neuromorphic architecture for tens-of-task lifelong learning
- Yuan Cheng
- Jianing Zhang
- Lu Fang
Light: Science & Applications (2024)
Comparative Analysis of Diabetic Retinopathy Classification Approaches Using Machine Learning and Deep Learning Techniques
- Ruchika Bala
- Arun Sharma
- Nidhi Goel
Archives of Computational Methods in Engineering (2024)
Genetic-efficient fine-tuning with layer pruning on multimodal Covid-19 medical imaging
- Walaa N. Ismail
- Hessah A. Alsalamah
- Ebtsam A. Mohamed
Neural Computing and Applications (2024)
Self-supervised pre-training with contrastive and masked autoencoder methods for dealing with small datasets in deep learning for medical imaging
- Daniel Wolf
- Tristan Payer
- Timo Ropinski
Scientific Reports (2023)

Subjects

Abstract

Similar content being viewed by others

Background & Summary

Methods

Design principles

Details for MedMNIST2D

PathMNIST

ChestMNIST

DermaMNIST

OCTMNIST

PneumoniaMNIST

RetinaMNIST

BreastMNIST

BloodMNIST

TissueMNIST

Organ{A,C,S}MNIST

Details for MedMNIST3D

OrganMNIST3D

NoduleMNIST3D

AdrenalMNIST3D

FractureMNIST3D

VesselMNIST3D

SynapseMNIST3D

Data Records

Technical Validation

Baseline methods

AutoML Methods

Evaluation

Benchmark on each dataset

Average performance of each method

Difference between Organ{A,C,S}MNIST and OrganMNIST3D

Usage Notes

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links