A large dataset of white blood cells containing cell locations and types, along with segmented nuclei and cytoplasm

Accurate and early detection of anomalies in peripheral white blood cells plays a crucial role in the evaluation of well-being in individuals and the diagnosis and prognosis of hematologic diseases. For example, some blood disorders and immune system-related diseases are diagnosed by the differential count of white blood cells, which is one of the common laboratory tests. Data is one of the most important ingredients in the development and testing of many commercial and successful automatic or semi-automatic systems. To this end, this study introduces a free access dataset of normal peripheral white blood cells called Raabin-WBC containing about 40,000 images of white blood cells and color spots. For ensuring the validity of the data, a significant number of cells were labeled by two experts. Also, the ground truths of the nuclei and cytoplasm are extracted for 1145 selected cells. To provide the necessary diversity, various smears have been imaged, and two different cameras and two different microscopes were used. We did some preliminary deep learning experiments on Raabin-WBC to demonstrate how the generalization power of machine learning methods, especially deep neural networks, can be affected by the mentioned diversity. Raabin-WBC as a public data in the field of health can be used for the model development and testing in different machine learning tasks including classification, detection, segmentation, and localization.


Scientific Reports
| (2022) 12:1123 | https://doi.org/10.1038/s41598-021-04426-x www.nature.com/scientificreports/ decrease in the number of white blood cells. For example, in allergic diseases, the number of basophils increases, or in blood malignancies, we can see an increase in the number of precursors of blood cells and changes in their shape and size. Therefore, determining the correct type and number of white blood cells is very important for diagnosing various diseases. At present, manual (microscopic evaluation) and automated methods (using automatic hematology devices) are used to evaluate blood cells. Automated methods include devices which evaluate blood cells based on light scattering or electrical impedance such as Sysmex XP-300, Nihon Kohden Blood Cell Counter, and DH36 3-Part Auto Hematology Analyzer.
In electro-optical analyzers, a light-sensing detector measures the optical scattering. The size of the detected pulses corresponds to the size of the blood cells. Furthermore, in electrical impedance or Coulter principle cell counter, the passage of cells through an aperture in which an electric current is applied causes change in the electrical resistance. Pulses the height of which corresponds to the volume of the cell are counted, and this is considered as the basis of Coulter's principle working 4 . Besides these methods, microscopic hyperspectral imaging technology, as an emerging imaging modality, is currently being used. This method is a combination of spectroscopy and 2D imaging [6][7][8] .
One of the serious drawbacks of these devices apart from their high cost is the simple act of counting cells without them being evaluated qualitatively from a structural and morphological point of view. As a result, after evaluating the blood sample by the mentioned cell counters, it is necessary to prepare a smear and evaluate it microscopically by the laboratory staff to achieve an accurate and correct diagnosis.
On the other hand, issues such as the lack of specialists and laboratory equipment, heavy workload, inexperience, and incorrect diagnosis affect the test results. Misdiagnosis affects the treatment regime, and consequently, can result in the malpractice and an increase of associated costs. However, the use of new technologies such as artificial intelligence and image processing allows quantitative and qualitative evaluations to improve the quality of diagnosis 9 .
Over the past 20 years, the techniques for automated imaging of the blood-stained slides have been introduced by computer-connected microscopes capable of assessing blood cell morphology. With the development Table 1. Characteristics of white blood cells 4 .

WBCs % In blood Nucleus Cytoplasm Size (μm)
Neutrophils 60% It is divided into 2 to 5 segments and stains dark purple (multilobed) It is pale pink to tan with pink-purple granules [12][13][14][15][16] Eosinophils 3% It is blue and is divided into 2 segments It is full of pale pink tan with large orange and red granules [14][15][16] Basophils 1% It has 2 lobes that each stains purple, and is difficult to be seen It is pale pink-tan but contains large purple/blue-black granules which obscure the cell nucleus [14][15][16] Monocytes 6% It is singular and is kidney shaped (convoluted shape), bean shaped or horseshoe shaped with deep indentation It stains a blue-gray color and is "ground glass" with tiny granules, Vacuoles are sometimes present in it [14][15][16][17][18][19][20] Lymphocytes 30% It is large, round or oval, and is dark staining It is not present or very small, and is pale blue in color, and occasionally has purple-reddish granules 8-15   10 . In fact, today, deep neural networks are one of the most widely used machine learning methods for the classification and segmentation of medical images. Shahin, A et al. 11 used DNNs to classify white blood cells. In addition, these networks are used for the classification of red blood cells to detect a sickle cell anemia 12 . Deep neural networks are also used for the segmentation of the pancreas in the CT scan images 13,14 and the segmentation of the MRI images 10 . Data have the most important role in the development of machine learning models. In order to train deep neural networks and increase their generalizability, we need a lot of diverse precise data and confident labels. The process of labeling medical data should be carried out by professionals and is, therefore, a time-consuming and challenging procedure. As a result, medical databases are of high significance in smartening medical diagnoses. Unfortunately, researchers, today, have limited access to a variety of medical data for various reasons. Examples of available medical image databases are 15 and 16 . The database 15 contains 82 3D CT scans in which the Grand Truths of the pancreas for all slices were manually extracted by medical students and finalized by a specialist radiologist. Camelyon 16 is another dataset with 1399 whole-slide images of the lymph node smear samples with and without metastases, for which the labels were checked twice.
The morphological diversity of white blood cells is very high and in some cases, it is very challenging, even for an expert, to distinguish some classes from each other. On the other hand, many artificial intelligence articles have adopted two approaches to evaluate their proposed method regarding segmentation and classification of white blood cells: They have either collected small databases to the best of their ability [17][18][19][20] or used the small databases available [21][22][23] . Therefore, a database with a large amount of diverse data and reliable labelling is truly necessary to evaluate and compare different methods with each other. Such a reference database will allow more artificial intelligence scientists to enter the field and will help the advancement of intelligence differentiation of white blood cells. The most important characteristics of the Raabin-WBC dataset that distinguishes it from similar datasets are as follows: • Large number of data: We tried to collect as much data as possible for each class in order for them to be appropriate for all machine learning techniques, especially deep learning. (Approximately 40,000 white blood cell images) • Precise labels: We considered more detailed labels than five types of white blood cells. In fact, labels contain the most important subgroup of each type. For example, we considered the meta and band which are subgroups of neutrophils and are valuable in diagnosis. In the next section, more information about the labels will be presented. • Double labeling: For more insurance, most of the cells are labeled by two experts.
• Free public access: Since we aim at helping the development of artificial intelligence in hematology, the Raabin-WBC dataset is freely available for all. • Data cleaning: In the process of data collection, the existence of duplicate cell images is not inevitable. The first problem is that the duplicate cell images are not exactly the same. For example, there is a possibility of the cell being somewhat moved. The second issue is that having more than two versions of one cell image is also possible. Hence, we developed a fast graph-based image processing method that can accurately remove as many duplicate cells as possible. Despite this, it is still probable for some duplicate images to exist, albeit being significantly different. • The ground truths of nuclei and cytoplasm: The ground truths of the nuclei and cytoplasm are extracted for 1145 selected cells. In order to extract the ground truth of nuclei, we developed a toolkit that by using image processing techniques makes the ground truth extraction process much easier. • Diversity of the microscope and camera: Although most of the data were collected by a fixed type of microscope and camera, we collected some data with another type of camera and microscope, as well. In the section of experiments, you will see how new test data help to evaluate the generalization power of our trained models. In other words, the diversity of the dataset assists us in selecting a model that has correctly learned the manifold of cell images.
The rest of the paper is as follows: In "The characteristics of Raabin-WBC" section, we will elaborate more on the details regarding the dataset. In "Data collection" section, the data collection process will be explained completely. In "Experiments" section, we will do some machine learning experiments and discuss the generalization power of the models.

The characteristics of Raabin-WBC
In this section, more information is provided about the Raabin-WBC dataset. About 73 peripheral blood films were used for collecting this dataset. After imaging stained blood films, we tried to mine the most possible useful information from the raw data. For instance, the bounding box of all white blood cells and artifacts were extracted, cropped and labeled, successively. It is worth noting that a significant number of WBCs and artifacts were labeled by two experts. Furthermore, we provided the ground truth of the nucleus and cytoplasm for some of the cropped cells. The full details of the data collection steps are explained in Sect. 3. In Table 3, some general and useful information of the Raabin-WBC dataset is provided. Note that these numbers have been computed after the cleaning phase.

Labels.
In the Raabin-WBC dataset, more detailed labels are considered than just five general types of white blood cells. For example, besides the mature neutrophil, we have evaluated two other ancestors of this white blood cell: Metamyelocytes and Band. An increase in the number of band forms and metamyelocytes is one of www.nature.com/scientificreports/ the features of reactive neutrophilia (an increase in the number of circulating neutrophils to levels greater than 7.5 × 109/L) 5 . In addition, lymphocytes are divided into small (the main agents of the acquired immune system including B and T cells) and/or activated lymphocytes (activated small lymphocytes referred to as large lymphocytes or lymphoblasts). Bursts refer to smudge cells that are leukocyte remnants formed during blood smear preparation. Beside the leukocytes we considered drying artifacts as new labels, because artifacts are commonly seen after staining the samples. In Fig. 2, the diagram of the labels is presented.
In Table 4, the number of labels associated with two experts is shown. The rows and columns of the table belong to the first and second experts, respectively, noting that 9015 cells have not been labeled yet. We asked our experts to label the cells as unrecognizable if they had any doubts. Indeed, we have 1099 cells labeled as not recognized by the two experts. In Table 4, you can see the amount of disagreement for each pair of different labels (Non-diagonal elements of the matrix). For example, large and small lymphocytes are confused a lot. Also, seem bands have often been mistaken with mature neutrophils. Other examples of confusing pairs are artifact and burst, large lymphocyte and monocyte, and small lymphocyte and burst. The high numbers in the rows and columns labeled as not recognized indicate that it is very challenging to identify the type of white blood cell.
Data structure. The Raabin-WBC dataset consists of images that were taken from blood films (similar to Fig. 5). Corresponding to each microscopic image, a dictionary (.json format) file containing the following information about that image was provided: • Information about the blood elements in the image including their coordinates and labels. Most of the elements are labeled by two experts. • Information about the blood smears including staining method and the type of the disease. Note that all blood smears have been prepared from normal samples. Only a Chronic Myeloid Leukemia (CML) sample has been used to extract basophils. • Information about the microscope includes the type of microscope and its magnification size.
• The type of camera used.
There is also a subset of the database called double-labeled Raabin-WBC which includes cropped images of the five main types of WBCs and were labeled the same by both of the experts. We will explain more about this sub-dataset in the experiment section.

Data collection
The data were collected from patients and the ethical approval was gotten from the ethical committee of Hematology, Oncology, and SCT Research center of Shariati hospital. We confirm that all methods were performed in accordance with the relevant guidelines and regulations. The steps of data collection (Fig. 3) include preparing blood smears and photographing them, extracting the bounding box of white blood cells, data cleaning, and finally labeling the data and extracting ground truths. More details are explained in the rest of this section.
Preparation of blood smears and imaging. 72 normal peripheral blood films (male and female samples from ages 12 to 70) have been used to collect neutrophils, eosinophil, monocyte, and lymphocyte images. On the other hand, due to the very low presence of basophils in normal specimens (< 1-2%) 9 , basophils of one CML-positive sample have been imaged. Owing to the widespread use of Giemsa in medical labs 9 , all samples were stained by Giemsa. It should be noticed that all samples were taken from collaborator medical laboratories (Razi Hospital in Rasht, Gholhak Laboratory, Shahr-e-Qods Laboratory, and Takht-e Tavous Laboratory in Tehran, Iran) and we did not deal directly with patients. It is worth noting that in Iran, it is necessary to get the approval of patients only in clinical trials. But in retrospective studies, this is not necessary in Iran. The process of imaging the slides was performed by the help of two types of microscopes, namely Olympus CX18 and Zeiss at a magnification of 100×. Since determining the Diff area to evaluate and count different types of white blood cells is of utmost importance, an expert lab staff had supervised the cell imaging process. With smart phones being widely used in society, a rapidly growing trend has emerged to adapt them to medical diagnostics 24,25 . The availability, ease of use and low cost of high-pixel density cameras available in smart phones make them widely used in various science fields [26][27][28][29][30][31][32] . Therefore, in compiling this database, the cameras available on smart phones have been used, the details of which are given in Table 5. Smartphones can be adapted for microscopic imaging using some accessory equipment [33][34][35] . To facilitate the use of smart phones in microscopic imaging in this dataset, an adapter was designed and made by 3D printing to mount the smart phone on the microscope ocular lens (Fig. 4). The designed adapter has somewhat managed to minimize the drawbacks of the commercial models available in the market such as restrictions on the size of the phone and ocular lenses, as well as the difficulty of the adjustment.

Extraction of white blood cells from images.
In total, about 23,000 images were taken from blood films. There exist many red blood cells in each blood smear image. It is also probable that one or more other blood elements such as white blood cells and sometimes color spots exist in the image. The bounding box of these blood elements should be somehow identified. For this purpose, two approaches have been considered. Due to the distinct color of the nucleus in white blood cells, in the first approach, several white blood cells were Table 4. The number of labels associated with two experts. The rows and columns belong to the first and second experts, respectively. Artifact  3489  0  0  14  2  1  0  0  6  4  96  225   Band  0  311  0  2  0  0  2  0  32  0  16  71   Basophil  0  0  308  0  0  0  0  0  0  0  2  0   Burst  29  0  0  2673  1  11  0  4  1  32  96  525   Eosinophil  0  0  0  9  1466  0  0  0  1  1  13  607   Large lymph  0  0  0  1  0  2153  0  4  1  172  23  Data cleaning. In the process of imaging from the blood smears, a white blood cell may be placed in more than one image (Fig. 5). Therefore, duplicate cell images exist among cropped images. The major problem is that the two images of one cell are not necessarily very similar. Consequently, a simple mean square error on the value of the pixels is not enough to detect duplicate cell images. Indeed, a cell can be repeated more than twice. In Fig. 6, an example of three images of one cell is represented. As you can see, the qualities of the three images are different. Manual comparison of these images in pairs is practically impossible. Hence, an artificial intelligence algorithm, fast and accurate, has been developed to remove duplicate cell images. We used the Python ImageHash library, in this regard. First, for all pairs of cropped images, the Average Hash (AHash) and Perceptual Hash (PHash) values are calculated very quickly. Paired images, the AHash and PHash distances of which are less than those of the specific thresholds, are the same, and one of them should be removed. The thresholds of the Average Hash and Perceptual Hash are set manually through trial and error (See appendix 1 for more details).

Artifact Band Basophil Burst Eosinophil Large lymph Meta Monocyte Neutrophil Small lymph Not recognized Not labeled
Since an image may exist more than twice, a two-by-two comparison is not sufficient. For this purpose, a solution to the problem is presented from the Graph's point of view. In fact, we have a graph with N nodes (N is the number of cropped cell images from blood film). There exist edges between the nodes that satisfy the sameness condition. In this case, the connected components of the graph form equal images. Connected components of a graph can be calculated with the help of the breadth-first search algorithm very swiftly (See appendix 2 for more details). If a connected component has n > 1 images, n − 1 of them must be removed. To enhance the quality of the database, the image with the highest resolution remains out of n images, and the rest are deleted. The OpenCV 37 library is used to compare the resolution of images. In this regard, Sobel horizontal and vertical filters 38 are applied to the images and the gradient magnitude is calculated for each pixel. Finally, the image with the highest average gradient magnitude is selected, because it is the sharpest one.
As described in Sect. 2, to offer full information, we provide our data in the format of large and not cropped images (like Fig. 5). For each large image, the coordinates and the labels of the containing cells are provided. We tried to remove as many duplicates as possible from large images. Indeed, we remove a large image in which all containing cells are inside another image. For example, in Fig. 5, image b is removed.  Figure 4. Designed adapter to mount smart phones on the ocular lens of a microscope to make the act of capturing the photos from the samples quicker and easier. Experts work with a microscope manually and see the images on the mounted smartphone and take photos.

Labeling process
This section describes the labeling process, which involves determining the cell types and the ground truth of the nucleus and cytoplasm. As you can see in Table 1, the characteristics of the nucleus and cytoplasm can significantly affect determining the type of the cell. Some papers 19,39 extract different features from the nucleus and the cytoplasm to classify white blood cells. These features usually describe the shape and the color of the nucleus and the cytoplasm.
Cell type labeling. For labeling cells, two applications were developed for Android (Fig. 7). One application is for labeling cropped cells (Fig. 7-part b), and the other is for selecting the location and type of each cell (Fig. 7-part a). Furthermore, a desktop application with the help of the Python Tkinter library 40 was developed for manually selecting the location and type of the cells (Fig. 8). It is worth mentioning that most of the images were labeled by two experts.
Ground truths of the nucleus and the cytoplasm. In recent years, many researchers have developed segmentation algorithms for the cytoplasm and nucleus of the white blood cells 3,[18][19][20][21]23 . Hence, we tried to prepare the ground truths of the cytoplasm and the nucleus for a proper number of cropped white blood cells. For this purpose, 1145 cropped images including 242 lymphocytes, 242 monocytes, 242 neutrophils, 201 eosinophils, and 218 basophils were randomly selected, and their ground truths were extracted by an expert. It is worth mentioning that we only prepared the ground truth of the whole cell for basophils, and we were not able to produce the ground truths of the nucleus and cytoplasm for basophils. This is because the basophils are usually covered by very purple granules, and the border between cytoplasm and nucleus is not easily visible. Figure 9 shows some samples of the cells along with their ground truths.
To produce the ground truths of nuclei, a newly published software called Easy-GT 41 was employed. This software has been developed to extract the ground truths of nuclei. In Easy-GT software, a nucleus is determined by a relatively accurate segmentation method, and if necessary, the user can adjust the ground truth of the nucleus by modifying the final threshold 41 (Fig. 10). In the segmentation process, the RGB image is first color-balanced 41 and converted to the CMYK color space. Secondly, the two-class Otsu's thresholding algorithm 42 applied to the  www.nature.com/scientificreports/ M channel gives us a threshold ( th 2class ). Again, the three-class Otsu thresholding algorithm is applied to the M channel and the two lower and upper thresholds ( th 3class low , th 3class up ) are extracted. Finally, the ultimate threshold value is obtained by computing the convex combination of th 2class and th 3class up . To make the ground truth of the cytoplasm, a light pen was used, and the ground truth of the whole cell was specified by an expert. Finally, by removing the nucleus part obtained from Easy-GT, only the cytoplasm remains.

Experiments
In this section, we are going to do some machine learning experiments on the Raabin-WBC data. Due to the diversity of information in the database, many research lines can be developed. Yet, we consider the most common possible experiment. We classify five classes of white blood cells, and we leave the rest to those who are interested in this field. For this purpose, we used the double-labeled cropped cells and considered only five main classes including mature neutrophils, lymphocytes (small and large), eosinophils, monocytes, and basophils. We called this sub-dataset Double-labeled Raabin-WBC. In the following, we will compare this database with some  www.nature.com/scientificreports/ existing 5-class databases and train some deep popular neural networks. We will also discuss the generalization power of the models.
A comparison with similar datasets. Various datasets of normal peripheral blood with different properties exist, but in general, most of them have a small number of samples. This is due to the fact that in the medical field, data collection and labeling are complicated. On the other hand, in the field of Hematology, artificial intelligence models are usually sensitive to some specifications of the dataset such as the number of data, the staining technique, the microscope and camera used, and the magnification. So, by altering the aforementioned characteristics, the accuracy of the models may be reduced. In Table 6, the characteristics of some datasets are presented and compared with Double-labeled Raabin-WBC. As you can see, our database is far better in several ways including data number, label assurance, ground truth, camera, and microscope variety. Most importantly, this database is available to everyone for free.
Utilized models. Some popular pre-trained deep neural networks were trained on Double-labeled Raabin-WBC to classify five types of white blood cells. VGG16 43 is the oldest CNN model consists of alternating convolutional and pooling layers. From deep residual network families, Resnet18 44 , Resnet34 44 , Resnet50 44 , and Resnext50 45 were tested. In Resnet architecture, identity shortcut connections that skip one or more layers are used 44 . Resnext is an extension of Resnet in which the residual block is replaced by a new aggregation component 45 . In mentioned aggregation component, the input feature map is projected to some lowerdimensional representations, and their outputs are aggregated 45 . Another CNN used in the experiments is DenseNet121 46 which consists of dense blocks. At each dense block, each layer is fed from all previous layers, and its outputs are transferred to all next layers. Another tested deep architecture is MobileNet-V2 47 which is suitable for mobile devices. The building block of MobileNet-V2 is an inverted residual block, and non-linearities are removed from narrow layers. MnasNet1 48 and ShuffleNet-V2 49 are other light-weight CNNs for mobile devices. In MnasNet, reinforcement learning is employed to find an efficient architecture 48 . In ShuffleNet-V2 49 at the beginning of the basic blocks, a split unit divides the input channels into two branches, and at the end of the block, concatenation and channel shuffling occur. Besides the aforementioned neural networks, we also utilized a feature-based method 50 in which the nucleus was segmented at first, and its convex hull was then obtained. After that, shape and color features were extracted using the segmented nucleus and its convex hull. Finally, WBCs were categorized by an SVM model. Classification results. The generalization power of the models described in the former section is to be examined at two levels. For this purpose, we split data into three groups of training data, test-A, and test-B, the properties of which can be observed in Table 7. The quality of the images in the test-A dataset is similar to that of the training dataset, but the images in the test-B dataset have different qualities in terms of camera type and microscope type. Unfortunately, the test-B data only contains double-labeled neutrophils and lymphocytes.
The training data are not balanced, in other words, the number of cells in each class is imbalanced. Hence, the training set was augmented and moderated using augmentation methods such as horizontal flip, vertical flip, rescaling, and a combination of them. In order to evaluate the models, four metrics are considered for each class: www.nature.com/scientificreports/ 9, the results of the feature-based classification presented in the paper 50 are showed. In Fig. 11, the plots of the accuracy and the loss of training data and validation data related to nine pre-trained models are shown. The results are surprising, and all methods have an acceptable outcome on the test-A data. Yet, the performance of most of the models on the test-B data experience a dramatic decrease. The feature-based method 50 had the least performance reduction, despite having the lowest accuracy on the test-A data. Among deep neural networks, the VGG16 43 network has relatively more generalizability. It can be said that the feature-based method could extract more meaningful features from cell images than the deep neural networks. If we had not tested the models on the test-B data, we would have thought that we have trained a strong classification model; yet, this was   www.nature.com/scientificreports/ not the case. In this experiment, we do not want to conclude that deep neural networks have less generalization power than feature-based methods. If we applied some appropriate pre-processing on the images before training or used some smarter image augmentation methods, the performance of deep neural networks would be better. In this experiment, you can easily understand the role of the dataset in the training of machine learning models. All training processes were carried out using a single NVIDIA GeForce RTX 2080 Ti graphic card and were handled by Python 3.6.9 and Pytorch library version 1.5.1. We considered 15 epochs for the training process and the starting learning rate, and the batch size were 0.001 and 10, respectively. The learning rate was decayed by the ratio of 0.1 and step size 7. Stochastic gradient descent was utilized as the optimization method. We used the Torchvision library in order to load pre-trained networks on the ImageNet dataset 51 . The output size of the last linear layer was changed from 1000 to 5. Table 9. The results of different pre-trained models as well as Tavakoli et al. 50 on the test-B dataset.

Conclusion
By evaluating the peripheral white blood cells, a wide range of benign diseases such as anemia and malignant ones such as leukemia can be detected. On the other hand, early detection of some of these abnormalities, such as acute lymphoid leukemia, despite its lethality, can help its treatment process. Therefore, it is important to adopt methods that can be effective in the early detection of different diseases. The role of machine learning methods in intelligent medical diagnostics is becoming more and more prominent these days. Indeed, deep neural networks are revolutionizing the medical diagnosis process and are considered as one of the stare-of-the-arts. Since deep neural networks usually have a huge number of training parameters, the overfitting problem is not highly unlikely. Therefore, the diversity of training data is necessary and cannot be ignored. In medical diagnostics, in particular, this diversity gets bolder, because the medical devices can be very diverse. For example, in the field of hematology, the type of microscope and camera is very influential. To this end, we collected a huge free available dataset of white blood cells from normal peripheral blood so as to relatively satisfy the mentioned diversity. This multipurpose dataset can serve as a reference dataset for the evaluation of different machine learning tasks such as classification, detection, segmentation, and localization.