Rapid and low-cost insect detection for analysing species trapped on yellow sticky traps

While insect monitoring is a prerequisite for precise decision-making regarding integrated pest management (IPM), it is time- and cost-intensive. Low-cost, time-saving and easy-to-operate tools for automated monitoring will therefore play a key role in increased acceptance and application of IPM in practice. In this study, we tested the differentiation of two whitefly species and their natural enemies trapped on yellow sticky traps (YSTs) via image processing approaches under practical conditions. Using the bag of visual words (BoVW) algorithm, accurate differentiation between both natural enemies and the Trialeurodes vaporariorum and Bemisia tabaci species was possible, whereas the procedure for B. tabaci could not be used to differentiate this species from T. vaporariorum. The decay of species was considered using fresh and aged catches of all the species on the YSTs, and different pooling scenarios were applied to enhance model performance. The best performance was reached when fresh and aged individuals were used together and the whitefly species were pooled into one category for model training. With an independent dataset consisting of photos from the YSTs that were placed in greenhouses and consequently with a naturally occurring species mixture as the background, a differentiation rate of more than 85% was reached for natural enemies and whiteflies.


Results
Classification of the sub-images (evaluation). The first comparative rating of insect images from the YSTs immediately after trapping (Lab0d) and after a retention time of 7 days (Lab7d) yielded partly considerable morphological differences (Fig. 4). In the BEMITA category, for example, white pairs of wings of freshly trapped animals were clearly visible, but this appearance changed towards almost complete transparency. A dark coloured centre with very blurred boundaries between the individuals and the yellow background was retained. In contrast, the TRIAVA category, which is also a whitefly species, was only slightly affected by changes over time compared to the BEMITA category. Here, the white wings were still visually distinguishable from the yellow background even after 7 days of retention. In the MACRPY category, colour changes can also be observed. Here, the body colour of MACRPY individuals changed from bright green to pale green or dark green. However, even after 7 days of exposure, the paired symmetrical antennae of the animals were still well recognizable. In the ENCAFO category, no significant differences in the subjective appearance of the insects could be detected.
The effects of varying the model parameters vocsize, colour and quantizer on the prediction accuracy and precision are shown separately for the validation of the models on unseen sub-images from the Lab0d and Lab7d datasets together (Tables 1, 2) and for the independent full scene images from the GH0-7d dataset (Tables 3,  4). Regarding the quantizer used, no considerable differences were found in terms of the class mean accuracy. Moreover, it should be noted that the use of VQ requires considerably higher computing power. Since it produced only a minor improvement in accuracy, no further comparative results regarding the quantifier are given.
Classification without temporal and categorical pooling. The lowest average recall in the entire experiment was 77.57%. This validation value was achieved by applying the k-d tree model 165 trained with temporal but no categorical pooling and HSV converted images as well as a vocsize of 200 words on the Lab7d dataset ( Table 1). The maximum recall achieved was 93.97% for model 164 applied on the Lab0d dataset and was thus approximately 3% higher than that of model 163 trained with greyscale images and the same vocsize. The maximum recall achieved by testing on the Lab7d dataset was 79.76% using greyscale images and thus slightly better than that using RGB images. In contrast, the precision values decreased after 7 days of the insects remaining on the YST by approximately 20% compared to the test results of the initialization measurement on day 0.
With regard to class mean accuracy, the dictionary size had no obvious influence but on the recall in individual categories. Within the individual categories, the recall of the BKGRND class was the highest, as expected. A maximum value of 99.13% was achieved without differences in colour space conversion or dictionary size. The results within the ENCAFO class were similar. Here, the best classification results were achieved with greyscale images and dictionary sizes of 200 and 500 words. For the best-performing model in terms of the overall accuracy (ID 164), the values for accuracy and precision with and without categorical pooling were plotted individually in Fig. 1A,B, respectively.
A direct comparison of the two categories BEMITA and TRIAVA (Lab0d dataset) showed that the models trained with 500 words achieved a 1% better recognition rate in the TRIAVA category, while the same value for the BEMITA category decreased by almost 2%. In terms of precision, a considerable decrease in the BEMITA and BKGRND categories was found. For the BEMITA category, the precision values were only between 50% and almost 60%; for the BKGRND class, the values were even lower, with a maximum of 44.23%. This finding indicates a less robust prediction model, especially for the categories mentioned above. However, BEMITA and www.nature.com/scientificreports/ TRIAVA were less well recognized than the objects in the other categories. Remarkably, the recall in the BEMITA category shows slightly increased values even after one week of the insects remaining in the trap. The highest recall was found here, with 86.64% for RGB images and a dictionary size of 200 words. The different performances of the models with regard to feasible categorical pooling of BEMITA and TRIAVA are shown again in a recall-precision plot (Fig. 1). Essentially, this plot shows that images of the Lab0d dataset from the categories ENCAFO, MACRPY and BKGRND were detected correctly with almost 100% confidence; for BEMITA and TRIAVA, the values are above 80%. Here, it becomes apparent that the morphological changes in individual insects due to the 7 days remaining on the YST show a considerable influence on the detection rate.
The lowest detection rates were found in all other categories of images of the Lab7d dataset with the exception of MACRPY and in addition to the BKGRND class (Fig. 1A). With the exception of MACRPY, the values for ENCAFO and TRIAVA are below 60% for the recall but above 75% for the precision. For BEMITA, the recall is above 86% (100% for BKGRND), but the precisions are below 60% and 50%. After pooling the categories BEM-ITA and TRIAVA, the negative effects of decay could be reduced. Only the categories BKGRND and ENCAFO still show a poor detection rate with the use of the Lab7d dataset.  Table 2 suggest that a good detection rate can also be achieved for morphologically similar individuals such as BEMITA and TRIAVA (better than 70% for RGB images and dictionary sizes of 200 and 500 words, respectively). Considering the individual categories, it is apparent that, in particular, the precision for BEMITA was slightly lower compared to the tests on the Lab0d dataset but was improved compared to the results of the test on the Lab7d dataset (compare Table 2 with Table 1). In the BKGRND category, on the other hand, the precision could clearly be improved. Overall, the mean accuracy for temporal pooling for both the recall and precision was close to 90%, indicating high model performance.
Classification with categorical pooling. Due to the morphological similarity of the BEMITA and TRI-AVA individuals, it was reasonable to perform another test to illustrate the recognition performance of the BoVW models after combining both categories (categorical pooling). The results depending on the new category BEM-TRI (BEMITA and TRIAVA) are shown in Table 3. The maximum overall accuracy achieved was 98.46%  Table 4. Classification results after temporal and categorical pooling. The recall and precision of the different models trained considered the categorical pooling of BEMITA and TRIAVA and the temporal pooling of Lab0d and Lab7d. As previously described, the results for different dictionary sizes (vocsize) and colour spaces (colour) are compared. www.nature.com/scientificreports/ (RGB images and a dictionary size of 200 words). Overall, the recall and precision indicated a slight gain in model performance compared to without pooling BEM-TRI. In the new category BEM-TRI, the recall values were always above 90% regardless of the insects remaining time on the YST, the used colour space and dictionary size with one exception (model ID 190 with dataset Lab7d, RGB images and a dictionary size of 500 words). The results in Table 3 also show that the values for precision were all approximately 90% or above except for the BKGRND class, which was slightly worse than without pooling BEM-TRI.

ID
Classification with categorical and temporal pooling. The last abstraction stage of pooling included a combination of the datasets Lab0d and Lab7d (temporal pooling), as well as the combination of the insect categories BEMITA and TRIAVA (categorical pooling). This stage is relevant under practical conditions because it cannot be determined exactly how long an arthropod has already remained on the YST at the time of sampling, and one sampling per week is realistic. After this pooling, the maximum overall accuracy was still 95.81% for RGB images and a dictionary size of 200 words ( Table 4). The precision and recall for all species categories were above 90% for all the models, indicating that this pooling resulted in the most robust model.
Test on the GH datasets-classification without categorical pooling. When non-temporally pooled models are applied to the image dataset GH0d-7d, the recall and precision results shown in Table 5 are obtained. The averaged overall accuracy for recall achieved its maximum value of 85.19% with model 164 (RGB images, vocabulary size of 200 words). The value for precision of the same model was remarkably lower at 67.11%. Here, model 163 performed best with a value of 67.11%. Particularly low values for the precision were found in the BEMITA category. Although an acceptable maximum recall of 71.64% was still achieved, the average recognition rate was only 25.44%. The reason for the low detection rate in the BEMITA category is illustrated by a confusion matrix (Table 6). From this table, it can be seen that instead of BEMITA, images are mainly classified as TRIAVA (n = 135). Category pooling is therefore quite reasonable and useful.
Test on the GH datasets-classification with temporal and categorical pooling. Based on the results of Tables 3 and 4, it will be shown how categorical pooling of the classes BEMITA and TRIAVA affects the recall and precision. The average overall recall increased strongly to a maximum value of 96.15% (model 184) when BEM-TRI was pooled ( Table 7). The overall precision for the same model increased by 15.48% to a value of 82.59% compared to the non-pooled data. In the pooled BEM-TRI category, the maximum value for the recall was 84.60%. This value was achieved with the model trained with RGB images and using a dictionary size of 200 words (model 184).
The effects of categorical pooling are again clearly shown in the confusion matrix (Table 8). All the images in the BEM-TRI category are correctly recognized.
Finally, an attempt was made to discriminate individuals applying the BoVW approach under practical conditions. For this purpose, 21 full scene images of the YST taken from the greenhouse chambers (GH0-7d dataset). Species were identified and located, and species were counted for the categories estimated above. Figure 2 shows the result of the localization. The white crosses denote the position of the manual samplings (output from ImgObjectLocator), and the red circles denote the position of the automatic detection. In Fig. 2A, manual and automatic annotation points for the category TRIAVA match nearly perfectly. Figure 2B shows a category map for the five trained categories BEM-TRI, ENCAFO and MACRPY with the corresponding colour coding.

Discussion
Although the experiments were performed with relatively simple materials and devices, the results regarding species identification as well as the experiences with handling were encouraging. A system where growers still need to handle each yellow trap regularly, although with reduced time and knowledge requirements, does clearly not fit all growing systems. However, especially in a greenhouse environment, where a relatively large number of traps is needed to adequately reflect arthropod population developments in the crops, it has advantages compared with systems currently on the market. The cost is low because all the techniques apart from the lighting setup are already privately owned by the user, i.e., part of a standard smartphone. Therefore, the costs are much lower than those of technically highly developed stand-alone trap systems, which require components such as cameras, lighting, transmitters and power supplies. Additionally, images can be taken from all kinds of traps   www.nature.com/scientificreports/ without dependencies on size or material. Furthermore, the additional workload of taking the pictures is limited since growers and workers in greenhouse vegetables are regularly working in the whole crop (at least once a week) and can take pictures of sticky traps on the go. Images of the YST in this study were taken with a smartphone, or, in other words, they were not taken with a sophisticated camera system. Considering this aspect, the performance of the studied machine learning approach was surprisingly good. Clear differentiation of E. formosa, M. pygmaeus and whiteflies, including B. tabaci and T. vaporariorum, was achieved. Differentiation between the two whitefly species was not possible, mainly because the true categorization of B. tabaci was not possible. This finding is independent of the dictionary size, the colour space and the SVM quantizer used. Therefore, it can be assumed that the SIFT feature detector is not able to find sufficiently significant features in the images of the species, which is especially the case when identifying individuals who have been trapped on the YST for several days. A workaround could be to monitor Table 8. Confusion matrix of classifications carried out with a time-and category-pooled training set (model 184) and tested on a category-pooled GH0d-7d image set. www.nature.com/scientificreports/ the YST more frequently. A differentiation between BEMITA and TRIAVA may be feasible with the presented procedures if the program corrected the BEMITA counts by substitution of individuals who were already counted as TRIAVA. Because the TRIAVA procedure had very few false positives (precision > 90%, Table 4), the remaining positives of the BEMITA procedure would have very few false positives as well because misdetection can entirely be accounted for by TRIAVA. The possibility of distinguishing both species has concrete implications for pest management because the parasitoid E. formosa is known to be less effective on B. tabaci than the parasitoid Eretmocerus mundus and the latter vice versa to T. vaporariorum. Additionally, the situation for virus transmission differs for both whitefly species. The differentiation success of the two beneficials from each other and the pest species using the GH0-7d dataset from the greenhouse environment is very promising. For instance, the recognition of M. pygmaeus remained very good even when high numbers of Cicadae were present on the traps, which are relatively similar in size and colour. This result indicates a certain robustness against ageing of the trapped insects, which, especially in whiteflies and M. pygmaeus, can result in a change in colour.

BEM-TRI
There is no doubt that deep learning approaches can achieve much better results than traditional methods, especially regarding the differentiation of very morphologically similar species. By taking images of a YST in a defined environment using a low-cost or smartphone camera and a Scoutbox, it was shown that three individuals were precisely classified on YSTs by applying region-based convolutional neural networks 18 . However, such an approach was not the aim of the present study. Here, a classical machine learning approach had to be tested for its suitability to be used with a YST because it should be considered that the size of training datasets is many times smaller than comparable datasets required to train deep neural networks.
However, it can be estimated from our results that precise differentiation between species on YSTs based on smartphone photos is possible using deep neural networks once sufficiently large datasets can be provided. The handling of the prototype is rather easy, but for professional use, the plastic box that was used to maintain a constant distance and even lighting needs to be designed with a lighting source and a clipping system that allows easy attachment to the YST. Furthermore, how robust the analyses are with regard to different smartphones and different camera types must be tested.
From the current data, experiments, and analyses, we assume that this technique is relevant for practice if 1) usability with common smartphone types is given and 2) robustness of analysis can be confirmed with images from a variety of growing situations and species mixtures on traps. Robustness against trap types from different producers would be a benefit and is already indicated due to the use of two trap types in the current study.

Concept of the bag of visual words. The principle behind BoVW is to reduce the information content
of an image in relation to a generalized set of features from many images of a specific theme (image universe). This generalized set of features is found by retrieving the most relevant information in the image universe with key point extractors and clustering the key point descriptors. The generalized features, or more specifically, the estimated cluster centres, are referred to as the visual dictionary of the image universe. With a new image from the same theme, it is then possible to approximate the nearest relationships between the features of the visual dictionary and the features extracted from that image. The frequency vector counting the number of specific relations can be seen as a footprint of the image. This frequency vector is called the bag of visual words. If we have known labels referenced to the images, we can train an image classifier just by classifying the BoVW vectors -typically with SVMs (support vector machines) or nearest neighbour approaches.
Image classification with BoVW. In our case, the image universe is images of YSTs, and the labels are given by the arthropod species and the background of the YSTs. The image classifier was created following the study by 22 , as shown in Fig. 3. According to this concept, the first step was to extract local image features from a training dataset of numerous insect images with a SIFT key point descriptor. SIFT is a blob detector invariant to scale and rotation and a standard algorithm for key point detection and description 23 . Based on unsupervised k-means clustering, Euclidean cluster centres within the feature spaces were then calculated. The cluster centres form the visual dictionary, which means that the code words constituting the dictionary represent the average Euclidean centre of the key point descriptors that belong to one of n clusters calculated by the average Euclidean dissimilarity from cluster to cluster. This mapping of new input data into one of the clusters by looking for the nearest neighbour was tested by vector quantization (VQ) and space partitioning in a k-dimensional tree structure (k-d tree). In contrast to the linear approximating process of VQ, k-d tree uses a hierarchal data structure to find the nearest neighbours by recursively partitioning the source data along the dimension of maximum variance 24 . To build the visual dictionary, we used our own classification framework based on SIFT key point detectors, which was provided by the VLFeat toolbox (version 0.9.21) for MATLAB 24 . The dictionary size (vocsize), e.g., the number of code words (or cluster centres), was systematically varied between 200 and 500.
In the second step, local image characteristics were again extracted from the insect images-this time with known labels-with the same key point extractor and compared with the references from the visual dictionary. Key point descriptors of new images were related with an approximate nearest neighbour search to construct frequency vectors. The frequency vectors were stored as bags of visual words. Combined with known labels on the images, these frequency vectors were finally used to calibrate the SVM image classifier.
Finally, support vector machines were used to calibrate the image classifier with a linear kernel using BoVW vectors with known labels as input. SVM modelling was performed with a stochastic gradient descent (SGD) solver, which minimizes the primal optimization problem of the SVM, particularly for very small objects. YSTs were put into the rearing cages for two minutes to trap a certain number of whiteflies. For trapping, the M. pygmaeus and E. formosa YSTs were put into a plastic box together with the respective carriers of E. formosa and M. pygmaeus as delivered by Katz Biotech. A YST with both TRIAVA and BEMITA trapped on it was produced to test whether discrimination of these very similar species within the Aleyrodidae family is possible. In addition to the abovementioned YSTs in the greenhouse, Horiver YSTs (Koppert Biological Systems) were used in the same manner. The YSTs were positioned at the height of the tip of the plants between double rows in a vertical alignment. In each of four soil greenhouse chambers (40 m 2 ), one YST was installed, and in a larger greenhouse (170 m 2 ), two YSTs were installed. These YSTs were removed weekly, images were taken, and fresh YSTs were installed. This resulted in a trapped species mix of TRIAVA, BEMITA, MACRPY and ENCAFO (GH0-7d), of which TRIAVA occurred naturally and the beneficials MACRPY and ENCAFO were introduced. Only BEMITA did not occur naturally and was introduced into the greenhouse. To produce BEMITA catches for the GH0-7d dataset, four YSTs were put into a BEMITA rearing cage for 2 min just before they were installed in greenhouse chambers for a duration of one week. This means that all BEMITA catches within the species mix of the GH0-7d dataset were exactly 7 days old when the images were taken. In these traps, a realistic number of other insects, including mainly other hemipterans, such as aphids and cicadellids, and Diptera, were present (Fig. 4A). These other species belong to the background and were not further differentiated.

Standardized image acquisition.
To develop a functioning object recognition algorithm, images of YSTs need to be taken not only in a standardized procedure but also under conditions comparable to practical tomato cultivation. The images were taken with a "Galaxy A3" smartphone (Samsung 2017). To achieve uniform illumination without reflections on the YST surface, the dim plastic freezer box "domino" (Rotho, Würenlingen) was used as shielding (Fig. 4B). The chosen plastic box had internal dimensions of 13.8 cm × 9.8 cm and a height of 7.3 cm. In the middle of the bottom part, a hole 1.3 cm in diameter was drilled. The lid was removed, and the In total, n = 5,866 images of insects were generated in this way from n = 166 YST images immediately recorded after trapping (Lab0d). Another n = 1,435 images of single insects were generated after 7 days (Lab7d) to account for decaying processes of the insects trapped on n = 87 YSTs. A completely independent dataset was generated from images of another n = 21 YSTs. From n = 1011 markers set in the ImgObjectLocator application, images were cropped to finally validate the BoVW models.
In summary, a dataset of four insect categories was generated for training and testing the BoVW models (Fig. 6). Such integration of categories containing only the target objects usually yields high detection rates under the assumption that it is sufficient to present a limited universe of subsequent prediction classes. This assumption is often true under defined laboratory conditions. In practice, such as insect detection under natural conditions, the detection performance would decrease and a higher number of false positives would be generated. To reduce this effect, we introduced an additional class, BKGRND, as the background or "garbage" class, which aims at the background of the yellow sticky trap (YST). The background training images were taken in the same way as the insect training images, but at locations where no insects were present. In the case of a low prediction probability for an insect class, a match with BKGRND would be high. If no background class were defined, the background of the YST would be shared among the other distinct classes, and sliding window prediction of the YST would not be possible. There was no need for an additional "garbage" class that would include other insect species because all insect species present in the greenhouse were within the prediction class space.
Hyperparameter optimization, training and testing. To estimate the influence on the prediction accuracy, different input parameters were systematically varied. These include the dictionary size for codebook generation (vocsize), the colour space of the input images (greyscale, HSV and RGB) and the quantization on the SVMs (quantizer). An optimization and limitation of these input parameters was previously carried out by pre-testing using the Lab0d dataset. To test the impact of different colour spaces, converted 8-bit greyscale and HSV (hue, saturation and value) images were used to train the BoVW models in addition to the original 8-bit RGB images. The colour space transformations were conducted with the MATLAB built-in functions rgb2gray and rgbhsv 25 . The systematic pre-test of parameter optimization finally resulted in 12 models for validation on a dataset of sub-images and for final testing. All of these models were based on the k-d tree quantizer, including three models each for colour space and dictionary size. Models with IDs from 163 to 165 and 169 to 171 were trained and tested with the full dataset of Lab0d and Lab7d (temporal pooling during model training) plus separate categories for BEMITA and TRIAVA (no categorical pooling during model training). During the training phase of the models with IDs 183-185 and 189-191, categorical pooling for the classes BEMITA and TRIAVA was carried out in addition to temporal pooling. Another six models were tested based on VQ (model IDs 160-162 and 166-168 with temporal pooling and 180-182 and 186-188 with temporal and categorical pooling). Each step was performed with a random split of the training data of 75% for modelling and 25% for optimization. Only the final output models were then evaluated on the independent test set.  www.nature.com/scientificreports/ Here, TP is the number true positives, which refers to the number of items correctly labelled as belonging to the respective class, and FP is the number of false positives that are incorrectly labelled belonging to the respective class. In contrast, TN is the number of true negatives, which refers to the number of items correctly labelled as not belonging to the respective class, and FN is the number of false negatives, which refer to items incorrectly labelled as not belonging to the respective class. The precision is the fraction of relevant categories that were retrieved over the total number of relevant categories. The recall is the fraction of relevant categories that were retrieved over the total number of relevant categories. The class mean accuracy refers to the proportion of the predictions that the model estimated correctly, which was calculated as class means.
Ethical approval. This article does not contain any studies with human participants or animals (vertebrates) performed by any of the authors.

Code availability
The code used to identify the examined insect species is not freely available.