## Introduction

Sickle cell disease (SCD) is the most common hematologic inherited disorder worldwide and a public health priority1. The majority of the world’s SCD burden is in sub-Saharan Africa, affecting millions of people at all ages. It is estimated that 200,000–300,000 children are born with SCD every year in Africa alone2,3. The prevalence of the disease varies across countries, being approximately 20% in Cameroon, Ghana, and Nigeria and even rising up to 45% in some parts of Uganda3.

SCD is an inherited disorder caused by a point mutation in hemoglobin formation, which causes the polymerization of hemoglobin and distortion of red blood cells in the deoxygenated state. As a result of this, the normally biconcave disc-shaped red blood cells become crescent or sickle-shaped in people living with SCD. These red blood cells are markedly less deformable, have one-tenth the life span of a healthy cell, and can form occlusions in blood vessels. Children with SCD also suffer from spleen auto-infarction and the burden of disease becomes significant. Loosing splenic function, these children are at high risk for infections at an extremely young age, which significantly increases mortality rates4. Due to the lack of diagnosis and treatment, over 50% of these of children with SCD in middle and low-income countries will die5.

Various methods have been developed for screening and diagnosis of SCD, including e.g., laboratory-based methods such as high performance liquid chromatography (HPLC)6, isoelectric focusing7, and hemoglobin extraction8. In addition to these relatively costly laboratory-based methods, there have been SCD diagnostic tests developed for point-of-care (POC) use9,10,11,12,13,14. These POC tests are mainly based on human reading, and human errors along with the storage requirements of these tests (involving e.g., controlled temperature and moisture to preserve chemical activity/function) partially limit their effectiveness to screen SCD, especially in resource limited settings14.

An alternative method used for screening of SCD involves microscopic inspection of blood smear samples by trained personnel. In fact, each year hundreds of thousands of blood smear slides are prepared in sub-Saharan Africa to make diagnosis of blood cell infections and disorders15. Peripheral blood smears, exhibiting variations in e.g., the size, color, and shape of the red blood cells can provide diagnostic information on blood disorders including SCD16. In addition to diagnosis, inspection of blood smears is also frequently used for evaluation of treatment and routine monitoring of patients17. Preparation of these blood smear slides is rather straight-forward (i.e., can be performed by minimally-trained personnel), rapid and inexpensive. However, this method requires a trained expert to operate a laboratory microscope and perform manual analysis once the blood smear is prepared; the availability of such trained medical personnel for microscopic inspection of blood smears is limited in resource scarce settings, where the majority of people with SCD live18. In an effort to provide a solution to this bottleneck, deep learning-based methods have been previously used to classify19 and segment20 different types of red blood cells from digital images that were acquired using laboratory-grade benchtop microscopes equipped with oil-immersion objective lenses. However, these earlier works focused upon cell level detection, rather than slide level classification and therefore did not demonstrate patient level diagnosis or screening of SCD.

As an alternative to benchtop microscopes, smartphone-based microscopy provides a cost-effective and POC-friendly platform for microscopic inspection of samples, making it especially suitable for use in resource limited settings21,22,23. Smartphone microscopy has been demonstrated for a wide range of applications, including e.g., the imaging of blood cells24,25, detection of viruses and DNA26,27, quantification of immunoassays28,29,30,31 and microplates32 among many others33,34,35,36. Recently, machine learning approaches have also been applied to smartphone microscopy images for automated classification of parasites in soil and water37,38.

### Imaging of thin blood smears

We used thin blood smear slides for image analysis. Our ground truth microscope images were obtained using a scanning benchtop microscope (model: Aperio Scanscope AT) at the Digital Imaging Laboratory of the UCLA Pathology Department. The standard smartphone camera application was used to capture the corresponding input images using the smartphone-based microscope, using auto focus, ISO 100, and auto exposure.

Areas of the samples captured using the smartphone microscope were co-registered to the corresponding fields-of-view captured using the benchtop microscope (please refer to “Image co-registration” in “Methods” section for details). Three board-certified medical doctors labeled the sickle cells within the images captured using the benchtop microscope using a custom-designed graphical user interface (GUI). As the images are co-registered, these labels were used to mark the locations of the sickle cells within the smartphone images, forming our training image dataset. We captured the images on the feathered edges of the blood smear slides, where the cells are dispersed as a monolayer.

Images from blood smears containing cells, which have been scraped and damaged were excluded from the dataset, as the cut cells can appear similar to sickle cells (see e.g., Supplementary Fig. 2). One normal blood smear was accordingly excluded as we were unable to capture a sufficient number of usable fields-of-view due to the poor quality of the blood smear, with many scratches on its surface. Blood smears from four patients who were tested positive for SCD and were taking medicine for treatment were also excluded from the study since their smears did not contain sickle cells when viewed by a board-certified medical expert.

### Image co-registration

The co-registration between the smartphone microscope images and those taken by the clinical benchtop microscope (NA = 0.75) was done using a series of steps. For the first step, these images are scaled to match one another by bicubically down-sampling the benchtop microscope image to match the size of those taken by the smartphone. Following this, they are roughly matched using an algorithm which creates a correlation matrix between each smartphone image and the stitched whole slide image captured using the benchtop lab-grade microscope. The area with the highest correlation is the field of view which matches the smartphone microscope image and is cropped from the whole slide image. An affine transformation was then calculated using MATLAB’s (Release R2018a, The MathWorks, Inc.) multimodal registration framework which extracts feature vectors and matches them to further correct the size, shift, shear, and account for any rotational differences45. Finally, the images were matched to each other using an elastic pyramidal registration algorithm to match the local features39. This step accounts for the spherical aberrations, which are extensive due to the nature of the inexpensive optics coupled to the smartphone camera. This algorithm co-registers the images at a subpixel level by progressively breaking the image up into smaller and smaller blocks and uses cross-correlation to align them.

### Image enhancement neural network

Due to the variations among the images taken by the smartphone microscope, a neural network is used to standardize images and improve their quality in terms of spatial and spectral features. These variations stem from e.g., changing exposure time, aberrations (including defocus), chromatic aberrations due to source intensity instability, mechanical shifts, etc. Some examples of the image variations that these aberrations create can be seen in Supplementary Fig. 3. The quality of the images taken by a smartphone microscope can be improved and transformed so that they closely resemble those taken with a state-of-the-art benchtop microscope by using a convolutional neural network39. Our image normalization and enhancement network uses the U-net architecture as shown in Fig. 5a46. The U-net is made up of three “down-blocks” followed by three “up-blocks”. Each one of these blocks is made up of three convolutional layers, which use a zero padded 3 × 3 convolution kernel and a stride of one to maintain the size of the matrices. After each of the convolutional layers, the leaky ReLU activation function is applied, which can be described as:

$${\mathrm{Leaky}}\,{\mathrm{ReLU}}\,({\mathrm{x}}) = \left\{ {\begin{array}{*{20}{c}} x & \quad{{\mathrm{where}}\,x > 0} \\ {0.1x} & {{\mathrm{otherwise}}} \end{array}} \right.,$$
(1)

where x is the tensor that the activation function is being applied to.

In the case of the down-block, the second of these layers increases the number of channels by a factor of two, while the second convolutional layer in the up-block reduces the number of channels by a factor of one quarter. The down-blocks are used to reduce the size of the images using an average pooling layer with a kernel size and a stride of two, so that the network can extract and use features at different scales. The up-blocks return the images to the same size by bilinearly up-sampling the images by a factor of two. Between each of the blocks of the same size, skip connections are added to allow information to pass by the lower blocks of the U-net. These skip connections concatenate the up-sampled images with the data from the down-blocks, doubling the number of channels. As the up-blocks reduce the number of channels by a factor of four and the skip connections double the number of channels, the total number of channels in each subsequent up-block is halved. Between the bottom blocks, a convolution layer is also added to allow processing of those large-scale features. The first convolutional layer of the network initially increases the number of channels to 32, while the last one reduces the number back to the three channels of the RGB color space to match the benchtop microscope images (ground truth). The network was trained using the adaptive movement estimation (Adam) optimizer with a learning rate of 1 × 10−4.

The image enhancement network is trained using a combination of two loss functions, described by the equation:

$$L_{{\mathrm{Network}}} = L_1\left\{ {z,G\left( x \right)} \right\} + \lambda \ast TV\left( {G\left( x \right)} \right),$$
(2)

where an L1 (mean absolute error) loss function is used to train the network to perform an accurate transformation, while the total variation (TV) loss is used as a regularization term. λ is a constant set to 0.03; this constant makes the total variation ~5% of the overall loss. G(x) represents image generated using the input image x. The L1 loss can be described by the following equation47:

$$L_1\left\{ {z,G\left( x \right)} \right\} = \frac{1}{{N_{{\mathrm{channels}}} \times M \times N}}\mathop {\sum}\limits_{{\mathrm{n}} = 1}^{N_{{\mathrm{channels}}}} {\mathop {\sum}\limits_{{\mathrm{i,j}} = 1}^{{\mathrm{M,N}}} {\left| {G(x)_{{\mathrm{i,j,n}}} - z_{{\mathrm{i,j,n}}}} \right|} },$$
(3)

where Nchannels is the number of channels, n is the channel number, M and N are the width and height of the image in pixels, and i and j are the pixel indices. The total variation loss is described by the following equation48:

$$\begin{array}{ll}TV(G\left( x \right)) = \frac{1}{{N_{{\mathrm{channels}}} \times M \times N}}\mathop {\sum}\limits_{n = 1}^{N_{{\mathrm{channels}}}} \mathop {\sum}\limits_{{\mathrm{i,j}} = 1}^{M,N} \left( \left| {G(x)_{{\mathrm{i}} + 1,{\mathrm{j,n}}} - G(x)_{{\mathrm{i,j,n}}}} \right| \right. \\ \left. \qquad\qquad\quad + \left| {G(x)_{{\mathrm{i,j}} + 1,{\mathrm{n}}} - G(x)_{{\mathrm{i,j,n}}}} \right| \right),\end{array}$$
(4)

The network was trained for 604,000 iterations (118.5 epochs) using a batch size of 16. The data were augmented through random flips and rotations of the training images by multiples of 90 degrees.

For this image enhancement network training, there is no need for manual labeling of cells by a trained medical professional, and therefore this dataset can be made diverse very easily. Because of this, it can also be expanded upon quickly as all that is required is additional images of the slides to be captured by both microscopy modalities and co-registered with respect to each other. Therefore, the network was able to more easily cover the entire sample space to ensure accurate image normalization and enhancement. The training image dataset consisted of 520 image pairs coming from ten unique blood smears. Each of these images have 1603 × 1603 pixels, and are randomly cropped into 128 × 128 pixel patches to train the network. Several examples of direct comparisons between the network’s output and the corresponding field of view captured by the benchtop microscope can be seen in Supplementary Fig. 1a.

### Mask creation for training the cell segmentation network

Once the cells were labeled by board-certified medical experts and the images were co-registered, the cell labels were used to create a mask which constitutes the ground truth of the segmentation network; this mask creation process is a one-time training effort and used to train the cell segmentation neural network used in our work. These training masks were generated by thresholding the benchtop microscope images according to color and intensity to determine the locations of all the healthy and the sickle cells. The exact thresholds were chosen manually for each slide due to minor color variations between the blood smears; once again, this is only for the training phase. As the centers of some red blood cells were the same color as the background, holes in the mask were filled using MATLAB’s imfill command, a morphological operator. Following this, the mask was eroded by four pixels in order to eliminate sharp edges and eliminate pixels misclassified due to noise. Any cell labeled by the medical expert as a sickle cell was set as a sickle cell, while any unlabeled red blood cell was set as a normal cell for training purposes. White blood cells, platelets and the background were all labeled as a third background class. As the medical experts might have randomly missed some sickle cells within each field of view, a 128 × 128 region around each labeled sickle cell was cut out of the slide for training, reducing the unlabeled area contained within the training dataset. The remaining sections of the labeled slides were removed from the training dataset. At the end of this whole process, which is a one-time training effort, three classes are defined for the subsequent semantic segmentation training of the neural network: (1) sickle, (2) normal red blood cell, and (3) background.

### Semantic segmentation

A second deep neural network is used to perform semantic segmentation of the blood cells imaged by our smartphone microscope. This network has the same architecture as the first image enhancement network (U-net). However, as this network performs segmentation, it uses the SoftMax cross entropy loss function to differentiate between the three classes (sickle cell, normal red blood cell, and background). In order to reduce the number of false positives as much as possible, the normal cell class is given twice the weight of the background and the sickle cells in the loss function. The overall loss function for the segmentation network, LSegmentation, is described in Eq. 5:

$$L_{{\mathrm{Segmentation}}} = - \frac{1}{{M \times N}}\mathop {\sum}\limits_{{\mathrm{i,j}} = 1}^{M,N} {a_{{\mathrm{i,j}},1}} \log \left( {p_{{\mathrm{i,j,c}} = 1}} \right) + 2a_{{\mathrm{i,j}},2}\log \left( {p_{{\mathrm{i,j,c}} = 2}} \right) + a_{{\mathrm{i,j}},3}\log \left( {p_{{\mathrm{i,j}},c = 3}} \right),$$
(5)

where M and N are the number of pixels in an image, and i, and j are the pixel indices as above. ai,j,c is the ground-truth binary label for each pixel (i.e., 1 if the pixel belongs to that class, 0 otherwise), and c denotes the class number (c = {1,2,3}), where the first class represents the background, the second class is for healthy cells, and the third class is for sickle cells. The probability pi,j,c that a class c is assigned to pixel i, j is calculated using the softmax function:

$$p_{{\mathrm{i,j,c}}} = \frac{{\exp \left( {y_{{\mathrm{i,j,c}}}} \right)}}{{\mathop {\sum}\nolimits_{k = 1}^3 {\exp \left( {y_{{\mathrm{i,j,k}}}} \right)} }},$$
(6)

where y is the output of the neural network.

A visual representation of the network architecture can be seen in Fig. 5b. Several examples of direct comparisons between the network’s output at the single cell level and the corresponding field of view imaged by the clinical benchtop microscope can be seen in Supplementary Fig. 1a.

This network was trained for 80,000 iterations (274 epochs) using a batch size of 20. The training dataset for this network was made up of 2660 sickle cell image patches (each 128 × 128 pixels) from a single blood smear slide, each one containing a unique labeled sickle cell. An additional 3177 image patches (each 128 × 128 pixels) coming from 15 unique slides containing solely normal cells were also used. Separate from our blind testing image dataset which involved 96 unique patients, 250 labeled 128 × 128-pixel sickle cell image patches and two 1500 × 1500-pixel images from healthy image slides were used as validation dataset for the network training phase. The classification algorithm was validated using these images alongside five unique fields-of-view from ten additional blood smear slides of healthy patients. The network was trained using the Adam optimizer with a learning rate of 1 × 10−5.

### Classification of blood smear slides

Once the images have been segmented by the second neural network, the number of total cells and sickle cells must be extracted. The algorithm first uses a threshold to determine which pixels are marked as cells. Areas where the sum of the sickle cell and normal cell probabilities is above 0.8 are considered to be part of a red blood cell, while areas below this threshold are considered as background regions. Connected areas which contain more than 100 pixels above the 0.8 threshold are then counted to determine the total number of cells. Sickle cells are counted using a similar methodology: connected areas where there are over 100 pixels above a sickle cell probability threshold of 0.15 were counted as sickle cells. This threshold is set to be low since significantly more number of healthy red blood cells is used to train the network. A slide is classified as being positive for sickle cell disease, when the percentage of sickle cells among all the inspected cells (sickle and normal red blood cells) over a total field-of-view of ~1.25 mm2 is above 0.5%. The 0.5% threshold was chosen using the validation image dataset, i.e., it was based on the network’s performance in classifying the ten healthy validation slides to account for false positives and the occurrence of sickle-shaped cells in normal blood smears. Several examples of direct comparisons between the network’s output and the ground truth labels for blindly tested regions of the labeled slides are shown in Fig. S1b.

### Structural similarity calculations

The SSIM calculations were performed using only the brightness (Y) component of the YCbCr color space as we expect the intensity contrast component to remain similar, while the chroma components (Cb, Cr) to depend on other factors, including variability in the slide’s staining. The color difference between the smartphone microscope images and the benchtop microscope images is also significant. The smartphone microscope images appear with a blue background, and should not directly be compared against the benchtop microscope images in the RGB color space. Therefore, using a color space where the brightness component can be extracted separately is necessary.

The calculations were performed upon eight unique fields-of-view from the same slides which were used to train the enhancement network. SSIM is calculated using the equation:

$${\mathrm{SSIM}}\left( {x,z} \right) = \frac{{\left( {2\mu _{\mathrm{x}}\mu _{\mathrm{z}} + c_1} \right)\left( {2\sigma _{{\mathrm{x,z}}} + c_2} \right)}}{{(\mu _{\mathrm{x}}^2 + \mu _{\mathrm{z}}^2 + c_1)(\sigma _{\mathrm{x}}^2 + \sigma _{\mathrm{z}}^2 + c_2)}},$$
(7)

where x and z represent the two images being compared, as above. µx and µz represent the average values of x and z, respectively, and σx and σz are the variance of x and z, and σz is the covariance of x and z. c1 and c2 are dummy variables, which stabilize the division from a small denominator.

### Monte Carlo simulation details

The Monte Carlo simulations reported in Fig. 4 demonstrate how the accuracy of the presented technique changes as a function of the number of cells analyzed by our neural networks; these simulations were implemented by beginning with the full cell count from the five fields-of-view tested for each patient slide. This total cell count was reduced by randomly eliminating some of the cells to evaluate the impact of the number of cells analyzed on our accuracy. As the cells are relatively monodisperse, this random removal of red blood cells was used as an approximation of a reduction of the inspected blood smear area per patient. The results of 1000 simulations were averaged since the accuracy can fluctuate significantly, particularly at low numbers of cells. The total number of cells within the five fields-of-view that we used for SCD diagnosis varies from 4105 to 13,989.

### Implementation details

The neural networks were trained using Python 3.6.2 and the TensorFlow package version 1.8.0. The networks were trained and test images were processed on a desktop computer running Windows 10 using an Intel I9-7900X CPU, 64 GB of RAM and one of the computer’s two GPUs (NVIDA GTX 1080 Ti). The enhancement network infers each field of view in 0.73 s while the classification network inference takes 0.64 s per field of view, taking a total of 6.85 s to process the entire 1.25 mm2 area of the blood smear. For both of the neural networks, the training image data were augmented by using random rotations and flipping. The hyperparameters and network architecture were chosen specifically for the datasets used in this paper, adjusted through experimental tuning.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.