Parasitic egg recognition using convolution and attention network

Intestinal parasitic infections (IPIs) caused by protozoan and helminth parasites are among the most common infections in humans in low-and-middle-income countries. IPIs affect not only the health status of a country, but also the economic sector. Over the last decade, pattern recognition and image processing techniques have been developed to automatically identify parasitic eggs in microscopic images. Existing identification techniques are still suffering from diagnosis errors and low sensitivity. Therefore, more accurate and faster solution is still required to recognize parasitic eggs and classify them into several categories. A novel Chula-ParasiteEgg dataset including 11,000 microscopic images proposed in ICIP2022 was utilized to train various methods such as convolutional neural network (CNN) based models and convolution and attention (CoAtNet) based models. The experiments conducted show high recognition performance of the proposed CoAtNet that was tuned with microscopic images of parasitic eggs. The CoAtNet produced an average accuracy of 93%, and an average F1 score of 93%. The finding opens door to integrate the proposed solution in automated parasitological diagnosis.

Parasitic infections are among the main public health problems worldwide, especially in tropical and subtropical countries 1,2 .According to the World Health Organization's Global Health Estimates in 2020, infectious and parasitic diseases are among Africa's leading causes of death 3 .There are three main classes of parasites that can cause disease in humans: protozoa, helminths, and ectoparasites 2 .Helminth infection has threatened over 800 million individuals throughout the world (CDC).Humans in developing countries, such as Sub-Saharan Africa, South America, and East Asia, are infected with medically significant nematodes, trematodes, and cestodes 3 .While these diseases manifest in a broad range of clinical manifestations, the elimination and early classification for the prevention of pathogenic helminths offers significant socio-economic benefits.
Microscopy is one of the most commonly used conventional methods in classifying and diagnosing parasitic diseases and is exclusively dependent on medical technicians for examinations 4,5 .However, microscopy-based parasite identification and quantification pose a lot of issues 5 .It is challenging, time-consuming, labor-intensive, error-prone and requires well-trained researchers for identification [5][6][7] .Because of the variances and ambiguities in the parasites' shape, density, and staining color, human experts find parasitic examination through microscopic photographs difficult 2,8 .Furthermore, these conventional methods lack a data-sharing framework and historical records of diagnosis.Consequently, the development of an automated diagnostic system would be a major leap in assisting traditional diagnosis.
With technological advancements and the development of image processing techniques and computer vision, artificial intelligence has been successfully used in various applications, such as face recognition, natural language processing, and biomedical image analysis 2,[9][10][11] .Resulting in automated diagnosis systems becoming more attainable.Many studies have implemented systems to analyze the micrographs of the samples based on machine learning, e.g.support vector machine (SVM) 12 and artificial neural networks (ANN) 12,13 .These traditional machine learning methods do not need complex structures, but they rely largely on selectively selected information.For this reason, fine-tuning the features in the feature extraction step will take a lot of time and effort.On the other hand, deep learning-based algorithms have been increasingly popular in the previous decade as computer performance and the number of available image datasets have both improved 14,15 .Deep learning has demonstrated remarkable efficiency in a variety of disciplines, including text recognition, computer-assisted diagnosis, facial identification, and drug development 16 .Deep learning, particularly the Convolutional Neural Network (CNN), stimulates novel parasite classification research in the parasite egg detection task because of its promising performance and speed in object recognition 4,5,8,13,17,18 .By learning important features automatically Table 1.Various classification methods with their drawbacks and advantages.

Method Pros Cons
Traditional human examination [5][6][7] Accurate with human experts *Challenging *Time-consuming, * labor-intensive *Error-prone *The need of well-trained researchers Machine learning: support vector machine 12 and artificial neural networks 12,13 These methods do not need complex structures Fine-tuning the features will take a lot of time and effort Deep learning-based algorithms 14,15 17 Good detection and classification accuracy Recognizing only three categories A deep convolutional selective Autoencoder architecture 26 Identification and classification of microscopic nematode eggs A high computational cost CNN-based transfer learning model Two networks, AlexNet and ResNet50 were examined and a patch-based technique with a sliding window was utilized to search for the egg location 5 Improve the effectiveness of automated parasite classification in low-resolution More time required to slide the window across high resolution images EfficientDet with EfficientNet-v2 backbone 33 High accuracy with localization capability Complex architecture of two backbones used to extract features and fuse the decision produced in the output layers EfficientNet-B7 with layers frozen and support vector machine (SVM) tuned 33 Less training time because no layers finetuning Trained with dataset that is not related to task studied and demonstrates high exceptional performance with abundant data.In a comparative study by 6 , an image processing-SVM model and VGG-16, a pre-trained deep neural network, were used for parasite classification wherein VGG-16 outperforms SVM.Currently, convolutional neural networks (CNN) are most frequently used in vision tasks.Butploy et al. 13 proposed a 3-layer CNN for A. lumbricoides.egg classification, measuring 93% in accuracy involving three classes.Since deep learning requires extensive data for training, challenges in limited datasets can be solved through transfer learning.This takes a neural network pre-trained on a large dataset and applies the architecture and learned weight on a new, but similar task.Examples of transfer learning for parasite classification include VGG-16 6 , GoogleNet, AlexNet, and Resnet50 5,28 .Akintayo et al. 26 designed a novel end-to-end Convolutional Selective Autoencoder (CSAE) to identify a parasitic worm, soybean cyst nematode (SCN).CSAE has two parts-the encoder and the decoder.CSAE architecture was developed for this rare-object detection class of problems.The 'selectivity' feature of CSAE possess remarkable detection speed, and accuracy in identifying SCN eggs across a wide range of samples e.g., soil, debris as it is trained to only reconstruct an ' egg' pattern while masking/rejecting all other patterns in the input image.Simon et al. 29 introduced a new architecture consisting of a shallow CNN appended with a single recurrent layer.Results showed better performance in the proposed models than in the state-of-the-art models.Wiem et al. 18 proposed a fusion decision method to identify parasitic eggs and utilized CNN by feeding the outputs for the discrete transform as input.
Suwannaphong et al. 5 presented a CNN-based transfer learning model to improve the effectiveness of automated parasite classification in low-resolution.Two networks, AlexNet and ResNet50 were examined and a patch-based technique with a sliding window was utilized to search for the egg location.Huo et al. 30 proposed an automatic recognition algorithm based on YOLO for parasitic eggs and achieving an average accuracy of 99.4%.On the other hand, Górriz et al. 31 trained a U-net model for the classification of leishmania parasites into promastigotes, amastigotes and adhered parasites.Najgebauer et al. 32 proposed a technique that uses a fully convolutional network (FCN) to analyze the complete sample space and give a class to each pixel in the image.The program was taught to identify parasite eggs and differentiate them from the adjacent or overlapping pollution.Roder et al. 7 implemented Deep Belief networks (DBN) for automatic classification of intestinal parasites viz eggs, larvae and protozoa.Considering the impurities of the fecal samples and unbalanced classes used, the datasets nevertheless, garnered promising results.Lastly, Nkamgang et al. 1 trained a neuro-fuzzy classifier according to a speeded up scaled conjugate gradient algorithm for recognition and classification of twenty human intestinal parasites.Achieving a recognition rate of 100%.
The objective of this paper is to propose a solution to classify parasitic eggs.This solution should meet specific requirements including ability to recognize various categories, high classification accuracy, low inference time, and memory efficiency.Table l demonstrates various methods used in the literature for parasitic egg classification.The existing methods showed several drawbacks such as complex structure, limitation of number of categories, high training time, low accuracy, computational cost and time, and uninformative features.In this work, we found that CoAtNet0 was able to address these drawbacks for parasitic egg classification task with simpler structure, higher accuracy, lower computational cost and time, better informative features, and capability to recognize various eggs categories.
This paper highlights an interesting challenge for the research community.It contributes to the body of knowledge as follows: 1. Various convolutional neural networks such as DenseNet121 34 and EfficientNet-B4 35 were explored.These pre-trained CNNs were fine-tuned with parasitic eggs dataset for classification purposes.2. Vision Transformer 36 which uses self-attention mechanism was demonstrated.It was fine-tuned with parasitic eggs dataset for classification purposes.3. A CoAtNet (Convolution and Attention Network) was proposed 37 for parasitic egg recognition in microscopic images for classifying parasitic eggs.The pre-trained CoAtNet was fine-tuned with parasitic eggs dataset for classification purposes.4.This work presents a novel attempt to utilize the concept of combining convolution and attention in medical applications such as parasitic egg recognition.5.This work targets a novel dataset that was proposed in ICIP2022 challenge to recognize eleven types of parasitic eggs acquired under various complex conditions such as illuminations and resolutions.
This paper is organized as follows: "Materials and Methods" describes the parasitic eggs dataset.Additionally, it discusses numerous convolutional neural networks and self-attention vision transformer.Additionally, we demonstrated convolution and attention network.In "Experiments and results", the experiments conducted are described to analyze results in detail.We performed an evaluation and comparison between various deep learning models.Finally, "Conclusion and future work".summarizes the work presented in this paper by giving readers a glimpse into potential improvements in the future.

Materials and methods
This section demonstrates the dataset used in this work for parasitic egg classification.Additionally, it describes various CNN based models utilized as baseline methods.Furthermore, it explores the proposed solution of convolution and attention network to be compared with the baselines.

Dataset overview.
A completely unique dataset called Chula-ParasiteEgg-11 was proposed for an ICIP 2022 competition [38][39][40] .The dataset includes 11 categories of different types of parasite eggs from faecal smears, with an average diameter between 15 and 100 μm, all labeled in bounding boxes.The following are among the categories: A. lumbricoides, Capillaria philippinensis, Enterobius vermicularis, Fasciolopsis buski, Hookworm egg, Hymenolepis diminuta, H. nana, Opisthorchis viverrine, Paragonimus spp., Taenia spp.egg, and T. trichiura [38][39][40] .With 1000 micrographs for the training set and 250 micrographs for the testing set for each category, the dataset is the largest collection of its kind.Multiple devices such as a Canon EOS 70D camera body with Olympus BX53 microscopes, a DS-Fi2 Nikon camera body with Nikon Eclipse Ni microscopes, Samsung Galaxy J7 Prime phone, and iPhone 12 and 13 with either 10× eyepiece lenses of Nikon Eclipse Ni or Olympus BX53 devices were used to collect the micrographs of the samples.As a result, the resolution, illumination, and setting conditions of each image vary.And this variation could render the detection more reliable [38][39][40] .Figure 1 shows several samples.Table 2 shows the number of samples for each category of parasitic eggs.
Parasite eggs vary between 20 and 80 μm dimensions, and they are usually seen under microscopes only.To detect eggs in microscopic images, long time of visual analyses is required by expert human.This task is very Figure 1.several samples of microscopic images including various condition such as illuminations, resolutions, sizes of eggs, and blurring [38][39][40] .

Trichuris trichiura 200
Total 2200 EfficientNet CNN.EfficientNet is a family of convolutional neural networks (CNN) which were built on using a newer approach for scaling up models, particularly designed for image recognition.Previous methods involve adding more layers or through image resolution.Scaling up one of the following: depth, width, or image size is a common practice.Tan and Le 35 introduced a method that scales all three in a uniform fashion following a fixed proportion.This was able to resolve several limitations found in older models and improve in accuracy on ImageNet 41 .EfficientNet consists of models B0-B7, starting with the baseline (B0) with its succeeding scaled up versions.In scaling dimensions, each feature held the ability to improve model accuracy but was met with restraints.It is presumed that more depth, or more layers would increase performance.However, this would require more training data 42 and computational power.Additionally, this method is not the only means of improving CNNs; Ba and Caruna 43 find that it is possible to train shallow neural networks to execute similarly to elaborate deeper CNNs.In a test assessing the scalability of each dimension, authors 35 observed that increase in width resulted in problems with analysing higher level features while in using very high resolutions, accuracy gain begins to stagnate.Overall, experimental results indicate that accuracy gain declines as models continue to expand when scaling only one dimension of the three.
In creating an improved model, it was important to ensure that all dimensions were balanced when scaling CNN.This was made possible by a compound scaling method wherein the compound coefficient φ is used to uniformly scale the dimensions as determined by the user.In classifying ImageNet, EfficientNet achieved accuracy scores equivalent to other state-of-the-art models such as ResNet, Inception-v3, and DenseNet while using much fewer parameters and having reduced FLOPs (floating point operations).In contrast to other CNNs, EfficientNets also run much faster.When used for transfer learning on other datasets including CIFAR-100, Food-101, etc. EfficientNets also exhibited remarkable results.Due its success on ImageNet and other extensive datasets, EfficientNet has been used over other applications in various domains.In medical imaging, EfficientNets were found to be superior in several classification of magnetic resonance imaging and X-ray imaging tasks.In chest abnormality identification featuring pneumonia, COVID-19, and normal lungs EfficientNetB0 performed higher than VGG16 and InceptionV3 44 .For MRI brain tumor recognition fine-tuned EfficientNetB0 also achieved the highest in performance 45 .In another study, EfficientNetB2 was found to be most suitable for breast cancer histopathology classification against other EfficientNets 46 .Ensemble methods also benefit EfficientNets; in a study for malaria diagnosis, infected cell micrographs were successfully classified 47 .These findings along with transfer learning results on large benchmark datasets support the potential of using EfficientNets for more computer vision tasks.

DenseNet CNN.
In the last few decades, deep CNN architecture has achieved many breakthroughs in image classification tasks 48,49 .However, as the CNN gets deeper, when the input information goes through many layers, there is a possibility to 'wash out' or vanish by the time it reaches the end (or the beginning) of the network.While various designs have varied network topologies and training methods, they still have one thing in common: they consistently make shorter pathways from earlier layers to later layers.Accordingly, Huang et al. 34 developed a new CNN architecture referred to as Dense Convolutional Network (DenseNet) which aims to increase the depth of deep learning networks while also improving training efficiency by using shorter connections between the layers.
In DenseNet architecture, each layer is connected to every other layer to ensure maximum information flow between the layers of the network 34 .Each layer receives extra inputs from all earlier layers and transmits its own feature-maps to all later layers in order to maintain the feed-forward structure of the architecture.Contrary to Resnets, DenseNet concatenates the features rather than combining them through summation 50 .Thus, instead of just the L, in traditional architectures, there are L(L+1)/2 direct connections in DenseNet.To successfully facilitate both down-sampling in the architecture and feature concatenation, the size of the feature maps should be uniform.This was made possible by dividing the network into multiple densely connected dense blocks where feature map size remains the same.Now, transition layers, the layers between blocks can perform the convolution and pooling operations outside the dense blocks, while inside the dense block is able to perform feature concatenation.As opposed to current network topologies, DenseNet may have very narrow layers, for example, k = 12.Where network growth rate is referred to as the hyper parameter k.As each layer produces concatenated k feature maps, the number of inputs is quite high and has huge computational requirements.To increase the efficacy, DenseNet utilizes Bottleneck layers.Four DenseBlocks with variable numbers of layers make up each architecture.For instance, while DenseNet-169 has (6, 12, 32, 32) layers, DenseNet-121 has (6, 12, 24, 16) layers in its four dense blocks.A Classification Layer follows the fourth dense block, accepting the feature maps from all network layers to carry out the classification.The ultimate goal of DenseNet is to utilize features from every layer to improve model performance and robustness using a standard dataset while requiring minimal computational labor and a much lower model size.However, it should also be noted that the increasingly massive number of layers can result in explosive growth during training 51 .
Due to its dense connections between layers, and improved model performance, DenseNet has been preferred over any applications in a variety of fields.In medical image classification, DenseNet is the first one to successfully perform anatomical segmentation of the whole brain using MR images 52 .In metastatic cancer image classification, DenseNet achieved superior performance over the state-of-the-art approaches 53 .
In another study, DenseNet was used in predicting COVID-19 patients from CT images 54 .In another study, DenseNet was used for classification of COVID-19 cases in medical imaging 55 .Furthermore, a sparsely connected DenseNet was used for malaria parasite detection 56 .

Vision transformer.
A vision transformer is a deep learning model that is used for image classification and was inspired by Dosovitskiy et al. 36 .In this paper, a parasitic egg image that has a sequence of patches encoded as a set of words was applied to the transformer encoder.The original image's patches N = (H × W)/P 2 were extracted with a fixed patch size (P, P) where P = 16, W is the image width, H is the image height, and N is the number of patches.The extracted patches were flattened and each patch x p belonged to R P2.C , where C is the number of channels.As a result, the 2D image was converted into a sequence of patches x∈R N×(P2.C) .Each patch in the sequence x was mapped to a latent vector with hidden size D = 768.The architecture of transformer's encoder with L blocks, each block containing alternating layers of multi-head self-attention (MSA) and multilayer perceptron (MLP) blocks.The layer normalization (LN) was applied before every block, while residual connections were applied after every block.
Convolution and attention.State-of-the-art (SOTA) Convolutional Neural Networks (ConvNets) (e.g., ResNet50 57 ) were the dominating deep architecture for performing computer vision tasks (e.g., object detection, semantic segmentation, image classification) due to their properties of proper inductive bias, efficiency, and generalization.However, Vision Transformer (ViT) and its variants 58 were explored in the benchmark dataset ImageNet-1k 41 and yielded promising results, but the performance of ViT still falls behind ConvNets due to its property of higher capacity at scale than ConvNets, thus it needs a very large-scale dataset in order to achieve comparable results to SOTA ConvNets.Recently, CoAtNet 37 was explored in ImageNet and it outperformed ViT and its variants.Additionally, its performance nearly matched the SOTA ConvNets (e.g., EfficientNet-V2 59  Several techniques were embedded in CoAtNet including self-attention and transformers which have been widely adopted for neural language processing and recently for different vision tasks.Additionally, relative attention (rel-attenttion) in CoAtNet belongs to the input-independent version which is computationally cheaper than the input-dependent version.The idea of combining convolution and self-attention for vision recognition usually enhances the accuracy with extra computational cost.On the other hand, CoAtNet with relative attention is a mixture of depthwise convolution and content-based attention combined into a single computation unit with less cost 37 .
MBConv block employs depthwise convolution 37 to capture the spatial interaction because it can be effectively merged into attention layers with simple relative attention.convolution depends on a fixed kernel to collect information from a local receptive field y.where x i , y i ∈ R D are the input and output at position i respectively, and L(i) denotes a local neighborhood of i, e.g., a 3 × 3 grid centered at i in image processing.
Self-attention allows the receptive field to be the entire spatial locations and computes the weights based on the re-normalized pairwise similarity between the pair (x i , x j ) 37  Recently, CoAtNet has been used in numerous medical applications (e.g., medical image segmentation, image classification).Classification of skin cancer types using CoAtNet was demonstrated 60 .Additionally, brain tumor image segmentation was proposed utilizing CoAtNet 61 .Furthermore, CoAtNet was compared with EfficientNet-V2, and ResNext50 which are SOTA ConvNets and it was found to outperform them for bone marrow cells classification 62 .

Experiments and results
This section describes the experiments conducted for parasitic eggs recognition in microscopic images.The setup of each experiment of model implementation is demonstrated.Additionally, the performance results such as accuracy, recall, precision, and f1 score are discussed.Furthermore, the comparison between CNN (1) x j (self attention) Experimental setup.The experiments were conducted by training the CNN based models, self-attention vision transformer, and CoAtNet based model using a TensorFlow framework on a 1 or 2 NVIDIA Tesla V100 GPUs.The hyperparameters such as batch size, number of epochs, optimizer, and input size have been tuned several times to find the optimal values that can produce the best validation accuracy using the validation data.As seen, different hyperparameters were found for each model including EfficientNet, DenseNet, vision transformer, and CoAtNet.The microscopic images were resized to 380 × 380 pixels for EfficientNet as mentioned in original paper 35 .In DenseNet, and CoAtNet, for image (input) size, 224 × 224 dimensions were used and applied to a network.After that, the 384 × 384 dimensions were applied.By comparing accuracy, it was found that increasing input size can improve classification accuracy remarkably.
The hyperparameters such as number of dense layers, and number of nodes in dense layer have been tuned several times to find the optimal values that can produce the best validation accuracy using the validation data.www.nature.com/scientificreports/As seen, the optimal number of dense layers, and number of nodes in each are shown in Tables 3, 5, 7 and 9 for each model.As known, increasing number of dense layers and their nodes leads to increase in computation cost and may result in over fitting problem, and thus dropout layers were added to avoid this problem.Additionally, the images were normalized using the mean and standard deviation of ImageNet.The details of architectures and hyperparameters for each model including EfficientNet, DenseNet, vision transformer, and CoatNet are demonstrated in Tables 3, 4, 5, 6, 7, 8, 9 and 10.The labeled microscopic dataset includes 11,000 images.The dataset was divided into: training with 6600 images (60%), validation with 2200 images (20%), and testing with 2200 images (20%).Each of these subsets should have equal number of images belonging to eleven categories to avoid imbalance distribution.
Results discussion.This section discusses the results of three experiments carried out to classify parasitic eggs in microscopic images into 11 classes.The performance metrics including accuracy, recall, precision, and  www.nature.com/scientificreports/F1 scores were calculated in each experiment.The comparison between DenseNet121, EfficientNet-B4, vision transformer (vit-16 base), and CoAtNet0 was done to evaluate and highlight the capability of each method in parasitic egg recognition task.
The first experiment was done to evaluate the performance of EfficientNet-B4 CNN.The version B4 was selected because it can balance between high accuracy and high speed of training and inference.Various numbers of layers tuned were evaluated to find the best hyperparameters that were able to produce the highest performance in terms of accuracy, recall, precision, and F1 scores.Tables 11, 12, and 13 show the classification metrics for each category in each of three scenarios: tuning last 5 layers, tuning last 20 layers, and tuning last 40 layers, respectively.www.nature.com/scientificreports/ The recall, precision, and F1 scores were calculated for each class out of eleven classes as shown in Fig. 4.Then, the average of recall, precision, and F1 scores were found.Additionally, the average of accuracy was found.It is obvious that tuning last 40 layers of EfficientNet-B4 CNN was able to give the highest results in terms of average accuracy (90%), average recall (90%), average precision (91%), and average F1 score (90%).On the other hand, tuning only last 5 layers was not efficient and the metrics produced were low with average accuracy (75%), average recall (75%), average precision (76%), and average F1 score (75%).The EfficientNet was pre-trained with Ima-geNet dataset which includes 1000 classes that are different from parasitic egg classes.The low metric of tuning only last 5 layers indicated that parameters of EfficientNet-B4 CNN were not able to extract informative features or presentations that were suitable for parasitic egg recognition.In other words, there was a need to tune more layers (last 40 layers scenario) to learn better parameters and more informative features that can differentiate between various types or categories of parasitic eggs.Additionally, the results of tuning last 40 layers show high F1 scores of recognizing Hookworm egg and Hymenolepis nana types of eggs.On the other hand, the results show low F1 scores for three classes of Paragonimus spp, Hymenolepis diminuta, and Ascaris lumbricoides which indicated inability of EfficientNet-B4 to distinguish between these types of eggs.
The confusion matrix of EfficientNet-B4 CNN after tuning the last 40 layers is shown in Fig. 5.The high values of elements in the main diagonal refer to a high accuracy of the model to recognize the parasitic eggs.The class 4 (Hookworm egg) was recognized perfectly with 199/200 correct predictions.On the other hand, the class 8 (Paragonimus spp.) was misclassified largely compared to other classes with 139/200 correct predictions.The majority of misclassified samples in class 8 were classified wrongly as class 5 (Hymenolepis diminuta) which indicated the similarity between features extracted by EfficientNet-B4 from microscopic images that belong to classes 5 and 8.  Figure 6 shows activation maps of various parasitic egg classes utilizing EfficientNet-B4 before tunning (freezing layers) and after tuning of 40 layers.The maps illustrate the capability of EfficientNet being tuned to focus the attention on the objects (eggs) inside the microscopic images and ignore irrelevant staff.On the contrary, EficientNet-B4 with layers frozen was unable to highlight regions in the image that were relevant to the class of egg.www.nature.com/scientificreports/ The second experiment was done to evaluate the performance of DenseNet CNN.Different versions of DenseNet such as 121, 169, and 201 were evaluated and compared.The DenseNet with 121 layers was selected because it can balance between high accuracy and high speed of training and inference.Various numbers of layers tuned were evaluated to find the best hyperparameters that were able to produce the highest performance in terms of accuracy, recall, precision, and F1 scores.Table 14 shows the classification report for scenario of tuning last 20 layers.The recall, precision, and F1 scores were calculated for each class from eleven classes.Then, the average of recall, precision, and F1 scores were found.Additionally, the average of accuracy was found.It is obvious that tuning last 20 layers of DenseNet121 CNN was able to give high results in terms of average accuracy (86%), average recall (86%), average precision (88%), and average F1 score (86%).The DenseNet121 was pre-trained with ImageNet dataset which includes 1000 classes that are different from parasitic egg classes.The scenario of tuning last 20 layers was found to be able to extract informative features that were suitable for parasitic egg recognition.Additionally, it learnt better parameters that can differentiate between various types or categories of parasitic eggs.The results of tuning last 20 layers showed high F1 scores for Hookworm egg and Fasciolopsis buski types of eggs.On the other hand, the results showed low F1 score of class Capillaria philippinensis which indicated inability of DenseNet121 to distinguish this type of parasitic egg.
The confusion matrix of DenseNet121 CNN after tuning the last 20 layers is shown in Fig. 7.The high values of elements in the main diagonal refer to high accuracy of the model to recognize the parasitic eggs.The classes 2 and 4 (Enterobius vermicularis and Hookworm egg) were recognized perfectly with 196/200 and 195/200, respectively of correct predictions.On the other hand, the class 8 (Paragonimus spp.) was misclassified largely compared to other classes with 145/200 correct predictions.The majority of misclassified samples in class 8 were classified wrongly as class 5 (Hymenolepis diminuta) which indicated the similarity between features extracted by DenseNet121 from microscopic images that belong to classes 5 and 8.
Figure 8 shows activation maps of various parasitic egg classes utilizing DenseNet121 before tunning (freezing layers) and after tuning of 20 layers.The maps illustrate the capability of DenseNet being tuned to focus the attention on the objects (eggs) inside the microscopic images and ignore irrelevant staff.On the contrary, Densenet121 with layers frozen was unable to highlight regions in the image that were relevant to the class of egg.
The third experiment was done to evaluate the performance of vision transformer.The version 16 was selected because it has fewer parameters.The vision transformer was pre-trained on ImageNet-21k, a dataset consisting of 14 million images and 21 k classes, and fine-tuned on ImageNet, a dataset consisting of 1 million images and 1 k classes.These datasets are different from parasitic egg classes.In Table 15, the recall, precision, and F1 scores were calculated for each class from eleven classes.
The fourth experiment was done to evaluate the performance of CoAtNet.The version 0 was selected because it has fewer parameters, fewer blocks and channels and can produce higher accuracy than other CNN-only based models.The CoAtNet was pre-trained with ImageNet dataset which includes 1000 classes that are different from parasitic egg classes.In Tables 16 and 17, the recall, precision, and F1 scores were calculated for each class from eleven classes.Then, the average of recall, precision, and F1 scores were found.Additionally, the average of accuracy was found.There were two scenarios to use CoAtNet as shown Fig. 9.The first scenario was to freeze all layers in the backbone and tune only head classification layers.This scenario produced the worst performance in terms of accuracy, recall, precision, and F1 scores as shown in Fig. 9 with 71% average accuracy, and 71% average F1 score.Freezing all layers of backbone indicated that parameters of CoAtNet CNN were not able to extract informative features or presentations that were suitable for parasitic egg recognition.In other words, there was a need to tune more layers to learn better parameters and more informative features that can differentiate between various types or categories of parasitic eggs.Therefore, the second scenario was implemented to tune all layers in the CoAtNet backbone.This scenario was able to give the highest performance in terms of average accuracy (93%), average recall (93%), average precision (94%), and average F1 score (93%) as shown in Fig. 9.The results of tuning all layers showed high F1 scores for the most of parasitic egg types.On the other hand, the results showed low F1 scores for two classes of Hymenolepis diminuta (82%), and Paragonimus spp (86%).However, the F1 score of Paragonimus spp in CoAtNet0 is still higher than one of EfficientNet-B4 (80%) and one of DenseNet121 (83%).
The confusion matrix of CoAtNet after tuning all layers is shown in Fig. 10.The high values of elements in the main diagonal refer to high accuracy of the CoAtNet to recognize the parasitic eggs.The classes 0, 1, 2, 4, 6, 9, and 10 were recognized perfectly with 199, 194, 197, 200, 196, 195, and 198 of correct predictions out of     Figure 11 shows activation maps and attention score of various parasitic egg classes utilizing CoAtNet0.The maps and scores illustrate the capability of CoAtNet0 to focus the attention on the objects (eggs) inside the microscopic images and ignore irrelevant staff.
Figure 12 compares between the proposed solution of CoAtNet and several baselines including CNN-only based models such as EfficientNet-B4 with 40 layers tuned (90% accuracy and 90% F1 score), DenseNet121with 20 layers tuned (86% accuracy and 86% F1 score), and EfficientNet-B7 with layers frozen and support vector machine (SVM) tuned 33 (82% accuracy and 82% F1 score).Similarly, Fig. 12 compares CoAtNet with other methods used for parasitic egg recognition 33 such as EfficientDet with EfficientNet-v2 backbone 33 (88% accuracy and 85% F1 score).The results showed superior performance of convolution and attention network method (93% accuracy and 93% F1 score) compared to baseline methods.Additionally, the CoAtNet0 that has lower computational cost was found to outperform the fusion decision method (92% accuracy and 93% F1 score) 33 that has complex architecture of two backbones used to extract features and fuse the decision produced in the output layers.
The combination of two techniques including attention and convolution in CoAtNet plays role to improve parasitic egg recognition because it combines properties of both models (convolutional network and vision transformer).The convolutional neural network gives the task of parasitic egg recognition more generalization,  and efficiency.On the other hand, the vision transformer gives the task larger model capacity that can benefits from the large parasitic egg dataset.As a result, the previous advantages of this combination lead to produce a higher testing accuracy that cannot be achieved by a standalone convolutional network such as EfficientNet and DenseNet.
Finally, it is necessary to study the cost of each model in terms of time complexity and memory efficiency.As shown in Table 18, CoAtNet0 was able to balance between inference time and number of parameters compared to CNN bae models (DenseNet121 and EfficientNet-B4) and self-attention-based vision transformer.CoAtNet0 can produce lower inference time and number of parameters than vision transformer.However, CoAtNet0 has higher inference time and number of parameters than EfficiientNet-B4 and DenseNet121.
This paper shaded light on an interesting challenge and contributed to the body of knowledge as follows: 1. Various pre-trained CNNs such as DenseNet121 and EfficientNet-B4, and pre-trained vision transformer were fine-tuned with parasitic eggs dataset for classification purposes.EfficientNet-B4 that has five layers tuned got the worst performance (75% accuracy and 75% F1 score).On the other hand, EfficientNet-B4 that has 40 layers tuned produced the best performance (90% accuracy and 90% F1 score).Additionally, DenseNet121 that has 20 layers tuned gave 86% accuracy and 86% F1 score.Furthermore, vision transformer gave 86% accuracy and 86% F1 score.2. A novel solution utilizing the concept of combining convolution and attention to recognize eleven types of parasitic eggs.A CoAtNet was proposed for parasitic egg recognition in microscopic images for classifying eggs into eleven categories.The pre-trained CoAtNet that was used after freezing all layers except the classification heads gave the worst performance of 71% accuracy and 71% F1 score.On the other hand, pre-trained CoAtNet with all layers tuned produced the best performance of 93% accuracy and 93% F1 score.3.This work targeted a novel dataset that was proposed in ICIP2022 challenge to recognize eleven types of parasitic eggs acquired under various complex conditions such as illuminations and resolutions.The proposed solution of convolution and attention was able to outperform not CNN-only based models and selfattention vision transformer, but also object detection method of EfficientDet with EfficientNet-v2 backbone.This finding proposes a novel technique of adding CoAtNet as a backbone in object detector to enhance the detection performance.

Conclusion and future work
This paper presented a fast, and highly accurate technique to recognize parasitic eggs in microscopic images and classify them into eleven categories.Various methods such as convolutional neural network (CNN) based models and convolution and attention (CoAtNet) based models were evaluated and compared using Chula-ParasiteEgg microscopic image dataset that was proposed in ICIP2022.The proposed identification technique of CoAtNet  was trained and fine-tuned with eleven types of parasitic eggs including various egg sizes and numerous conditions such as illumination, resolution, and blurring.A CoAtNet was found to reduce diagnosis errors and give high sensitivity.It was able to outperform other CNN based models such as EfficientNet-b4 and DenseNet121 and self-attention based models such as vision transformer.An average accuracy of 93%, and an average F1 score of 93% were resulted from the proposed solution which helps to diagnose the most common intestinal parasitic infections in humans in low-and-middle-income countries.This also contributed to preserve the status of economic and health sectors in countries.The finding opens door to integrate the proposed solution in automated parasitological diagnosis.In this study, the need to tune all layers of model requires powerful machine with multiple GPUs and large size of RAM memory to fine-tune large number of parameters which is costly process.Therefore, we used CoAtNet0 version because it has lower number of parameters compared to other versions to reduce the training time and hardware requirement and thus reduce the cost.Additionally, the selection of the proposed CoAtNet0 has a limitation related to inability to localize and classify parasitic eggs if more than one egg with different categories are available in the same microscopic image.
Hence, we intend to enhance the recognition performance by training bigger versions of convolution and attention networks, but this improvement requires more resources such as cloud platform with cluster of several machines, several GPUs, and memory.Additionally, the results showed superior performance of convolution and attention network compared to convolution-only models and thus they highlighted the potential of adding CoAtNet as a backbone in object detector to detect and classify multiple eggs with different categories in one microscopic image.Furthermore, augmentation of microscopic images by blurring and adding noise can enhance the recognition performance.

Figure 2 .
Figure 2. The pipeline of egg classification task.
: where g indicates the global spatial space.When the multi-stage layout is used, we mimic ConvNets to construct a network of 5 stages (S0, S1, S2, S3 and S4).The first stage S0 is a simple 2-layer convolutional Stem and S1 employs MBConv blocks with squeezeexcitation due to large spatial size.Starting from S2 through S4, we consider either the MBConv or the Transformer block, but the convolution stages must appear before transformer stages to process the local patterns.Considering generalization, model capacity, transferability and efficiency, simply stacking convolutional and attention layers, in a proper way.As a result, the C-C-T-T multi-stage layout was adapted for CoAtNet.where C and T denote Convolution and Transformer respectively.The CoAtNet architecture is shown in Fig. 3.Where L i refers to number of blocks.The relative attention (rel-attenttion) is a result of unifying depthwise convolution and self-attention.CoAtNet has several variants starting from CoAtNet-0 through CoAtNet-4.Conv layers, and MBConv blocks use kernel of size 3, while Transformer blocks use attention heads of size 32.As a result, 25 million of parameters have been trained in CoAtNet0.

Figure 6 .
Figure 6.Activation maps of various classes utilizing EfficientNetB4, where the left images are the activation maps after tuning the network, and the right images are the activation maps after freezing the network.

Figure 7 .
Figure 7. Confusion matrix of the DenseNet121 after tuning 20 layers.

Figure 8 .
Figure 8. Activation maps of various classes utilizing DenseNet121, where the left images are the activation maps after tuning the network, and the right images are the activation maps after freezing the network.
www.nature.com/scientificreports/200, respectively.On the other hand, the class 5 (Hymenolepis diminuta) was misclassified largely compared to other classes with 141/200 correct predictions.The majority of misclassified samples in class 5 were classified wrongly as class 8 (Paragonimus spp.) which indicated the similarity between features extracted by CoAtNet from microscopic images that belong to classes 5 and 8. Additionally, the classes 0 and 1 of Ascaris lumbricoides and Capillaria philippinensis were recognized well in CoAtNet (199/200 and 194/200 respectively) compared to (155/200 and 178/200 respectively in EfiientNet-B4) and (155/200 and 168/200 respectively in DenseNet121).

Figure 10 .
Figure 10.Confusion matrix of the CoatNet0 after tuning all layers.

Figure 11 .
Figure 11.Activation maps of various classes of CoAtNet0 (first row) and attention score visualization of CoAtNet0 (second row).

Figure 12 .
Figure 12.Comparison between the proposed CoAtNet and the baseline methods.
with the properties of Vision Transformers (e.g., capacity).CoAtNet was found to achieve high performances across different data sizes, and specifically, under the low-data system.CoAtNet inherited the generalization capability from Convolutional networks and superior scalability from transformer models.It was able to match huge version of Vision transformer-pre-trained on 23 × larger dataset.Given enough data, CoAtNet achieved faster convergence and improved efficiency with less computation to outperform Vision transformer.

Table 7 .
Architecture of vision transformer.

Table 11 .
Classification report of the EfficientNet-B4 after tuning last 5 layers.

Table 12 .
Classification report of the EfficientNet-B4 after tuning last 20 layers.

Table 13 .
Classification report of the EfficientNet-B4 after tuning last 40 layers.

Table 14 .
Classification report of the DenseNet121 after tuning last 20 layers.Significant values are in [bold].

Table 15 .
Classification report of the vision transformer after tuning all layers.Significant values are in [bold].

Table 16 .
Classification report of the CoAtNet0 after tuning only the classification head.

Table 17 .
Classification report of the CoAtNet0 after tuning all layers.

Table 18 .
Comparison between methods in terms of inference time and memory efficiency.
Architecture Number of parameters Inference time (s) Flops (B)