BCN20000: Dermoscopic Lesions in the Wild

Advancements in dermatological artificial intelligence research require high-quality and comprehensive datasets that mirror real-world clinical scenarios. We introduce a collection of 18,946 dermoscopic images spanning from 2010 to 2016, collated at the Hospital Clínic in Barcelona, Spain. The BCN20000 dataset aims to address the problem of unconstrained classification of dermoscopic images of skin cancer, including lesions in hard-to-diagnose locations such as those found in nails and mucosa, large lesions which do not fit in the aperture of the dermoscopy device, and hypo-pigmented lesions. Our dataset covers eight key diagnostic categories in dermoscopy, providing a diverse range of lesions for artificial intelligence model training. Furthermore, a ninth out-of-distribution (OOD) class is also present on the test set, comprised of lesions which could not be distinctively classified as any of the others. By providing a comprehensive collection of varied images, BCN20000 helps bridge the gap between the training data for machine learning models and the day-to-day practice of medical practitioners. Additionally, we present a set of baseline classifiers based on state-of-the-art neural networks, which can be extended by other researchers for further experimentation.


Background and Summary
Skin cancer is one of the most frequent types of cancer and manifests mainly in areas of the skin most exposed to the sun.Since skin cancer occurs on the surface of the skin, its lesions can be evaluated by visual inspection.Dermoscopy is a non invasive method which permits visualizing more profound levels of the skin as its surface reflection is removed.Prior research has found that this technique permits improved visualization of the lesion structures, enhancing the accuracy of dermatologists [1,9].
The increased availability of dermoscopic images has motivated the appearance of more sophisticated algorithms based on deep learning, mainly on convolutional neural networks [5,13,2].A significant player in the adoption of these algorithms in the community has been the International Skin Imaging Collaboration (ISIC), which has been organizing yearly challenges since 2016, where participants are asked to develop computer vision algorithms to segment and classify skin lesions in dermoscopic images [10,6,4,3].Tschandl et al. showed that the performance of expert dermatologist was already surpassed by the top-scoring algorithms of the ISIC 2018 Challenge [11,4].However, as the authors already pointed out, the algorithms tended to perform worse on images from other dermoscopic data sources, which were not represented in the HAM10000 dataset [12].
In BCN20000, we aim to study the problem of unconstrained classification of dermoscopic images of skin cancer, including lesions found in hard to diagnose locations (nails and mucosa), not segmentable and hypopigmented lesions: dermoscopic lesions in the wild.Most of the images would be considered hard-to-diagnose and had to be excised and histopathologically diagnosed.Together with the images, we provide valuable information related to the anatomic location of the lesion and the age and sex of the patients.Our efforts aim at creating a challenge which is more similar to what the dermatologists are doing when visiting a patient in the clinical practice.captured from 2010 until 2016 using a set of dermoscopic attachments on three high-resolution cameras that were stored using a directory structure in a server of the hospital.In order to create the BCN20000 database, these images have been retrieved, organized and filtered using various computer vision algorithms.Then, they have been linked with their corresponding diagnoses using a reference database.

Usage Notes
The images from the BCN20000 database can be divided into the following categories: nevus, melanoma, basal cell carcinoma, seborrheic keratosis, actinic keratosis, squamos cell carcinoma, dermatofibroma, vascular lesion and 'other' (lesions not contained in any of the other categories).To make the task more similar to clinical routine, each image is coupled with metadata regarding the anatomic location of the lesion, and the age and sex of the patient.
The dataset will be part of the ISIC 2019 Challenge [8], where participants will be asked to classify among various diagnostic categories and identify out of the distribution situations, where the algorithm is seeing a skin lesion it has not been trained to deal with.We will also make the dataset available through the ISIC Archive [7].
Finally, they have been manually revised to reassure plausibility of the diagnosis by several readers.The resulting database includes 19424 dermoscopic high-quality images corresponding to 5583 skin lesions captured between 2010 and 2016.All the data contained in the BCN20000 database has received the necessary institutional ethics approval (HCB/2019/0413).

Figure 2 :
Figure 2: Image count for each diagnosis confirm type (siec: single image expert consensus).