The first annotated set of scanning electron microscopy images for nanoscience

In this paper, we present the first publicly available human-annotated dataset of images obtained by the Scanning Electron Microscopy (SEM). A total of roughly 26,000 SEM images at the nanoscale are classified into 10 categories to form 4 labeled training sets, suited for image recognition tasks. The selected categories span the range of 0D objects such as particles, 1D nanowires and fibres, 2D films and coated surfaces as well as patterned surfaces, and 3D structures such as microelectromechanical system (MEMS) devices and pillars. Additional categories such as tips and biological are also included to expand the spectrum of possible images. A preliminary degree of hierarchy is introduced, by creating a subtree structure for the categories and populating them with the available images, wherever possible.


Background & Summary
Access to data is playing an increasingly central role in research. Indeed, many research fields rely almost entirely on the availability, curation, and sharing of global data sources provided through data repositories, which have nowadays become a vital part of research infrastructures.
In this context the NFFA-EUROPE project (www.nffa.eu), an advanced European distributed research infrastructure dedicated to nanoscience, has, as one of its key tasks, the objective to design effective means of data sharing at the nanoscale in Europe. The project involves twenty European research partners, and more than 150 different experimental/computational instruments and techniques; the volume of scientific data produced by the project activities is aimed to be hosted in the NFFA-EUROPE Information and Data Repository Platform (IDRP) 1 being open and fully available to the scientific community well defined data policy. The IDRP is complemented by a set of data analysis services that has been steadily growing since the beginning of the project.
Among the available instruments of the NFFA-EUROPE catalogue, Scanning Electron Microscopes (SEMs) are among the mostly requested ones from users, and are available at ten NFFA-EUROPE sites. SEM is indeed a routinely used characterization technique able to provide information about the samples' topography and composition down to a few nanometers resolution by scanning a focused electron beam onto the sample surface.
The first data analysis achievement is a service offered by the NFFA-EUROPE IDRP consisting of an automatic tool for image recognition to store, classify and label SEM images: we employed a supervised machine learning algorithm using deep convolutional neural networks for recognition of SEM images 2 . In order to train the network, it was necessary to provide a labeled training set, i.e. a set of SEM images properly classified by humans into categories.
In this paper, we describe the datasets presented in ref.
2, together with the methods employed to classify and validate them: the primary dataset used for neural network training (SEM dataset, Data Citation 1), a detached smaller dataset in which a hierarchical structure has been introduced (Hierarchical dataset, Data Citation 2), and two datasets which contain~7,000 newly classified images in addition to the SEM dataset, according to different validation criteria (Majority dataset, Data Citation 3, and 100% dataset, Data Citation 4), as described in the Technical Validation section.
The images were generated by the SEM located at the CNR-IOM (Istituto di Officina dei Materiali, www.iom.cnr.it) in Trieste, Italy. In total, close to 100 users have generated roughly 150,000 images over five years at the facility.
From this sample, an initial batch of~20,000 images was manually classified by nanoscientists into 10 categories which spanned a broad range of nano-objects, from 0D particles, 1D nanowires, 2D films and patterned surfaces, and 3D pillars and microelectromechanical devices.
This classified dataset (Data Citation 1) is intended to be functional to a number of possible implementations. Primarily, it can set the basis for a more complete labeled dataset for supervised machine learning which can be further expanded with more SEM images, thus allowing the development of an even more accurate and precise SEM image recognition software. Some categories can be split into multiple subcategories, as further discussed in the text.
On the other hand, image processing techniques are being increasingly applied to microscope images 3,4 . This dataset allows such programs to be tested on a large set of images belonging to a particular category for specific purposes.
In general, we are confident that the accuracy, diversity, and hierarchical structure of our SEM datasets can offer additional opportunities to researchers, connecting the image recognition and the nanoscience communities.
The paper is organized as follows: The Methods section describes how we retrieved and classified the SEM images; the Data Records section contains an overview of each dataset associated with this work, including the repository where they are stored, the contained data files, and their formats; in the Technical Validation section, we present the adopted validation procedure and the obtained results. Finally, the Usage Notes section offers a summary of the datasets and of the aims they were designed at.

Methods
Most of the methods described in detail in this section have been presented in our previous publication 2 .

SEM imaging
All the images collected in the datasets were acquired by nearly 100 scientists and researchers over five years using a ZEISS SUPRA 40 high resolution Field Emission Gun (FEG) SEM, located in the TASC laboratories of the CNR-IOM in Trieste. The microscope can be operated with 0.1-30 kV acceleration voltage and a 4 pA-10 nA probe current with a nominal resolution of 1.3 nm at 15 kV. It is equipped with an Everhart-Thornley Secondary Electrons Detector, a Back-scattered Electrons Detector, and a High efficiency In-lens detector, the latter providing an increased signal-to-noise ratio in image acquisition. The microscope is also equipped with a 4-Channel High-Angle Annular Scanning Transmission ElectronMicroscopy (STEM) detector allowing detection of transmitted electrons and with an Oxford Aztec Energy Dispersive X-ray Spectroscopy (EDS) system and a X-act 10 mm Silicon Drift Detector (SDD) for compositional analysis.

SEM image classification
The SEM images were classified based mainly on the dominant shape and structure of the imaged object. The dimensionality of the nanostructure (0D, 1D, 2D or 3D) was used as a first criterion, being the visual benchmark of the nanostructure itself. In this classification, 0D objects refer to particles which can be either dispersed and individually identified over the image area, or packed together into a clustered structure. 1D objects are referred to as nanowires, rods, or fibres. These structures can be either entangled in bundles or mutually aligned with respect to each other. 2D structures refer to films and coatings on a surface, and include a wide range of cases differing from each other for the surface roughness. 3D structures can refer to pillars or other micro-devices such as microelectromechanical systems (MEMS), typically fabricated by lithography techniques. A schematic view of this classification is illustrated in Fig. 1.
The dimensionality discussed above provided the main basis for categorization, but was not exhaustive. After investigation of the SEM images database and discussions with SEM scientists, additional categories such as biological samples and tips (these last mostly related to atomic force microscopes) were introduced to enlarge the spectrum of the possible categories. The full set of selected categories is illustrated in Fig. 2, with a representative image for each of them, while the breakdown of the number of images in each category is shown in Table 1. These ten categories cover a broad range of SEM images, and they are applicable to scientists working in different areas of nanoscience. At the same time, they were chosen to be as more distinct from one another as possible.

Preliminary degree of hierarchy
A second dataset of 1,038 images (Data Citation 2) was selected to explore a subtree for each category, when appropriate. We warn that this must be considered a preliminary work, due to the very small number of images in~40% of the subcategory nodes. A summary of this initial work is shown in Table 2. The table must be read from the left column to the right one as root-to-leaf branches. Images in some categories did not present a sufficient variety to be split into subclasses. In these cases, the columns of Level 1 and Level 2 were left empty. However, we do not exclude the possibility to enrich the dataset with new subcategories in the future, when more images will be analyzed.

Code availability
The Python script used to generate the Majority dataset (Data Citation 3) and the 100% dataset (Data Citation 4) includes the following steps:

Data Records
In this section, we provide detailed information on the published datasets, including a breakdown of each of them. In addition to the original SEM dataset, we present a preliminary hierarchical subset and two increased datasets obtained by classifying and validating other~7,000 images. A schematic description to summarize the main details is reported in Table 3, while validation results are described in the Technical Validation section and summarized in Table 4.

SEM dataset
To guarantee the reproducibility of the convolutional neural network training results results presented in ref.
2, we provide the reference to the original SEM dataset (Data Citation 1) used in the paper. The SEM dataset consists of 18,577 SEM images produced at the CNR-IOM (Trieste, Italy). The images are 1,024 × 728 pixels jpg files, classified into 10 categories in a folder structure ( Table 1). The folders are individually stored as tar archives, for a total size of~10 GB, and are publicly available for download (Data Citation 1).

Hierarchical dataset
The Hierarchical dataset is composed of 1,038 SEM images produced at CNR-IOM (Trieste, Italy). The images are 1,024 × 728 pixels jpg files, classified into 10 categories and 27 subcategories (as sub-trees) in a folder-subfolder structure ( Table 2). The root folders are individually stored as tar archives, including the subfolders, for a total size of~1 GB, and are publicly available for download (Data Citation 2). The preliminary results obtained from this dataset have been published in ref. 2.

Majority dataset
The Majority dataset consists of 25,537 SEM images produced at CNR-IOM (Trieste, Italy). The images are 1,024 × 728 pixels jpg files, classified into 10 categories in a folder structure according to the majority criterion (Table 4): classification labels have been checked by a group of nanoscientists by means of our SEM web classifier (sem-classifier.nffa.eu), and only those images which have been validated by the absolute majority of the group have been included in the dataset. The folders are individually stored as tar archives, for a total size of 11 GB, and are publicly available for download (Data Citation 3).

Technical Validation
In this section, the procedure applied to validate the two increased SEM datasets and the statistical results obtained are presented. Since October 2017, we have labeled other~7,000 images, thus increasing the dataset size by~37%. As explained later, the method we applied ensures the same conclusions to be true also on the original SEM dataset (Data Citation 1).
To provide a reliable validation, 200 images (20 for each category) were randomly selected out to form a set (which we call human set) aimed to be judged by a group of 7 scientists, expert in SEM image interpretation. The human set was incremented by picking the same number of images (which we call image recognition set) classified by a neural network we had previously trained (see ref. 2 for details). We chose only image recognition results with a score ≥ 99%. We developed a web site (sem-classifier.nffa.eu), where the scientists were asked to access and validate the label assigned to each of the images or to suggest a different category, being unaware of which set (human or image recognition) they were coming from.
The dataset was validated at two levels: if a label of the human set reached the absolute majority of agreement from scientists, the image was validated (Majority dataset, Data Citation 3). This is a commonly adopted method, used e.g. to validate ImageNet 5 ; alternatively, a more stringent criterion was to validate only the images for which the label reached the agreement of all the scientists in the group (100% dataset, Data Citation 4). By definition, the majority dataset contains the 100% dataset.
The results of the validation analysis are reported in Table 4. Looking at the last two rows, it can be noted that the images validated by the absolute majority of the scientists are almost 80% of the total (311 over 400), 168 (54%) of which come from the human set and 143 (45%) from the image recognition set. On the other side, roughly 50% of images (204 over 400) were validated by all the scientists; also in this case, more than 50% of validated images come from the human set (115 over 204).
It is worth noting that during the validation procedure 48 out of the 200 human labeled images were tagged by nanoscientists as "Not clear". They should not be considered as misclassified images, rather they can be used to estimate the human uncertainty contribution to the validation error.
Based on the above results, the appropriate error on each of the two validated datasets can be expressed as: