Semi-automated identification of biological control agent using artificial intelligence

The accurate identification of biological control agents is necessary for monitoring and preventing contamination in integrated pest management (IPM); however, this is difficult for non-taxonomists to achieve in the field. Many machine learning techniques have been developed for multiple applications (e.g., identification of biological organisms). Some phytoseiids are biological control agents for small pests, such as Neoseiulus barkeri Hughes. To identify a precise biological control agent, a boosting machine learning classification, namely eXtreme Gradient Boosting (XGBoost), was introduced in this study for the semi-automated identification of phytoseiid mites. XGBoost analyses were based on 22 quantitative morphological features among 512 specimens of N. barkeri and related phytoseiid species. These features were extracted manually from photomicrograph of mites and included dorsal and ventrianal shield lengths, setal lengths, and length and width of spermatheca. The results revealed 100% accuracy rating, and seta j4 achieved significant discrimination among specimens. The present study provides a path through which skills and experiences can be transferred between experts and non-experts. This can serve as a foundation for future studies on the automated identification of biological control agents for IPM.

Scientific RepoRtS | (2020) 10:14632 | https://doi.org/10.1038/s41598-020-71798-x www.nature.com/scientificreports/ can be applied to supervised learning tasks, such as regression and classification. Several advantages of XGBoost have been reported, including its rapid computational speed, favourable model performance, ability to manage sparse or missing data, and low specimen requirements [16][17][18] . Identification using XGBoost has recently become popular in studying medicine 18 and plant diseases 19 . Therefore, these characteristics of XGBoost make it suitable for use on biological organisms. The mite family Phytoseiidae (Acari: Mesostigmata) inhabits diverse habitats, such as agronomic crops, horticultural crops, xylophyta, herbaceous plants, and weeds 20,21 . Some members of the family are predators of phytophagous mites and other small arthropods, whereas others feed on fungal spores, leaf juices, pollen grains, soil litter, and other plant materials [21][22][23] . They have received much attention because of their potential as biological control agents in IPM 21 . The predatory mite Neoseiulus barkeri Hughes is a species found worldwide and distributed across Asia, the Americas, Australia, Africa, and Europe 24,25 ; it can be used as a biological control agent. McMurtry et al. 21 proposed a lifestyle classification of phytoseiid mites, which included four types and ten subtypes. N. barkeri is a generalist predator with a type III lifestyle, meaning that it feeds on mite pests, some small insects such as thrips and whiteflies, and various pollens 21,[26][27][28][29][30][31] . This species is also known for controlling Thrips tabaci 26,29 , Polyphagotarsonemus latus 27 , and Tetranychus urticae 30,31 .
To help non-taxonomists identify the aforementioned biological control agents, a semi-automated identification method using XGBoost based on morphological features (e.g., dorsal setal length) was developed. The dorsal setal lengths of phytoseiid mites have long been studied to identify them and reconstruct their phylogeny 7,8,33,34 . In the present study, taxonomists first selected morphological characteristics. Subsequently, we manually extracted quantitative measurements for analysis. Therefore, the method in this study was semi-automated. Automatic identification by using photomicrograph to study mites is difficult because of their small body size and the complexity of phytoseiid morphology. This is also time-consuming because microscopic photos must be used. Furthermore, dorsal setae are occasionally missing because of the slide-making procedure. XGBoost was advantageous in the present study for its rapid computational speed and ability to function even with missing data. Therefore, we developed a method for the semi-automatic identification of N. barkeri and related species on the basis of their morphological features.

Results
We collected and captured microscopic images of 512 collected female phytoseiid specimens. Subsequently, from the microscopic images, we manually recorded 22 morphological measurements, including the lengths and widths of dorsal and ventrianal shields, lengths of 14 pairs of dorsal setae, length of ventral seta JV5, length of macroseta St IV, and length and width of the spermatheca calyx ( Fig. 1, Table 1, Table S1).
During cross-validation, the number of errors in distinguishing the target species (i.e., N. barkeri) decreased as the number of decision trees increased and the XGBoost model fitted to training dataset. By contrast, the number of prediction errors of the fitted model regarding the morphological features of the testing dataset was lower with a greater number of trees but was generally higher than of the model during the training process. Through cross-validation and after 23 decision trees has been generated, the training and testing processes attained the same performance level in XGBoost modelling, and they achieved the identification accuracy of 100 and 99%, respectively (Fig. 2). Finally, with all datasets and the determined number of decision trees, the identification accuracy reached 100%.
Because the number of decision trees was determined through cross-validation, the results of the final model using full data exhibited zero errors in distinguishing between the target species and related species. In addition, this final model indicated key morphological features (according to their importance; Fig. 3), and the individual conditional expectation (ICE) plots exhibited their determinative roles (Fig. 4). Twenty-one features (not all) were considered critical for target species identification. Furthermore, these key features were assigned to four clusters with different levels of importance. Their determinative roles influenced the probability of the target species having several patterns (unimodal, monotonically increasing or decreasing, or more complex patterns). For example, seta j4 was the most crucial feature (more than 40%) and exhibited a unimodal pattern. The predicted probability of being the target species was clearly higher when the length of seta j4 varied within the range of and 30 μm. In addition, this probability decreased for other higher or lower values of seta j4 relative to that for the middle range. By contrast, the seta J5 was the second most crucial feature. Rather than exhibiting a unimodal pattern, a longer length of seta J5 monotonically increased with an increase in the probability of being the target species. A monotonically decreasing pattern was identified for the length of seta j3, which was considered the 11th most crucial feature. Notably, the spermatheca was the most crucial identifying feature, but it was ranked 18th in terms of feature importance (Figs. 3 and 4).

Discussion
To the best of our knowledge, the present study is the first to use a machine learning technique to identify biological control agents for agricultural use and is also the first such study on mites. The results achieved 100% accuracy in distinguishing between target species and related species. The high percentage of correctly identified specimens from our data set suggests that the practical use of XGBoost for phytoseiid identification may become more prevalent. Generally, more predictors (i.e., features in our study) for a certain number of samples (i.e., specimens) can improve modelling performance. Although overfitting occurs when noise is mistaken for information, validation helps to prevent this problem from influencing model performance. After validation, high accuracy (100%) was retained. This was unsurprising because a taxonomist selected morphological traits that have long been considered useful for identification.  13 . Furthermore, such an approach cannot be applied in every situation (e.g., the identification in our study). Nevertheless, we resolved the problems of insufficient and missing data by using machine learning to identify biological organisms. XGBoost saves considerable time and achieves more reliable identification than does DL (e.g., Fedor et al. 14 ). This method is convenient for nonexperts to use; however, several limitations must be mentioned. First, DL can only import raw image data or extract features automatically. However, features must be extracted first when XGBoost is applied; therefore, the selection of morphological features in the first step is crucial. Second, only the taxa included in the training set can be identified using the XGBoost algorithm. The identification of novel and excluded species is problematic for XGBoost and all machine learning techniques 10 . Third, the morphological boundaries of closely related spe-    www.nature.com/scientificreports/ subfamily Amblyseiinae were selected. Dorsal setae j2, j5, J2, S2, S4, and S5 were excluded from the comparison. The present results indicated that XGBoost can overcome interspecific variations among phytoseiid mite specimens. However, for the classification of more taxa, setal patterns may present a limitation for comparison, and more morphological features should be chosen. Some features are crucial for mite identification through machine learning. The results presented here reveal that 22 quantitative features can be helpful for identifying phytoseiid mites. Most of these measurements involved setal length, and the results indicated that seta j4 is one of the most critical features. Dorsal setae are the most critical features for phytoseiid identification and phylogeny 32 . However, the morphological features have intraspecific variations within species caused by factors such as gene expression, life stage, and environmental conditions 10 . Variations in setal length and pattern can occur in phytoseiid mites 8,34 . Tixier et al. 7 reported significant differences in setal lengths between winter and summer seasons, indicating that additional features are necessary for species identification. Tixier 8 reported on setal length variations in phytoseiid mites and recommended using at least 10 specimens for comparison. Moreover, apart from aspects related to the mites themselves, two possible reasons for errors should be considered: measurement errors 5 reveal a need for standard measurement among experts; and excessive use of medium while making slides influences setal length. The slide-making technique also affects the observation of morphological features 6 , such as chelicera dentition and spermatheca shape, both of which are key identifying features. However, Toyoshima and Amano 34 reported six conditions of setal pattern variation in phytoseiid mites: (1) absent seta, (2) additional seta, (3) inserted seta, (4) deviated seta, (5) expanded seta, and (6) shortened seta. They considered such phenomena to be caused not by accidents during post-embryonic development but by heritable features. Future efforts can be focused on the association between phylogeny and ML classification; this association may provide a basis for mechanistic explorations.

Conclusions
The automated identification of biological organisms may gradually become a reality through the development of machine learning techniques, which are widely applied in domains such as medicine 18 and agriculture 14,19 . The automated extraction of morphological features is a method that will be used in the future but requires more development in the context of mite research. Many studies have reported on the automatic extraction of morphological features, such as insect wing veins 3,9 ; however, mite studies have many limitations. High-quality slide specimens and high-resolution microscopic photos are required. Porto and Voje 9 proposed a machine learning pipeline for automated detection and landmarking of biological structures in images. This method may enable the collection of morphological data from images. However, the main morphological features of phytoseiid mites are setal patterns (presence or absence) and setal lengths 8,32 . Detecting setae is impossible because their direction of setae may vary and can be affected by the slide-making technique. In addition, Prasad and Tixier 33 indicated that the difficulty in obtaining accurate length measurements is due to setae not being straight and having thin and pointed tips. To develop automated identification techniques for phytoseiid mites, additional features must be used for identification, such as the distances between setae, distance between pores, and shield width and length. However, tabletop scanning electronic microscopes (TSEMs) may be another method for capturing high-resolution photos. TSEMs have recently begun being applied in mite studies; for example, Yamasaki et al. 35 compared light microscope and TSEM photos of feather mites. TSEMs have several advantages, such as the ability to capture the living positions of specimens, simple operability by non-experts, and no need for pre-processing.
Our study is the first to successfully identify biological control agents through artificial intelligence. Accurately identifying pests is the first step of IPM (e.g., Fedor et al. 14 ); however, the accurate identification of biological control agents is also necessary 2 . The present study developed a semi-automated identification method for phytoseiid mites by using XGBoost, which can facilitate the accurate identification of target and non-target phytoseiid species for IPM. After morphological features of phytoseiid mites were selected in accordance with experts' recommendations, the machine learning technique could bridge the gap between taxonomists and non-taxonomists, through the transfer of taxonomy skills and experience. This approach can also provide a quick method for dealing with insufficient or missing data. Therefore, XGBoost can be used for the identification of mites and for that of all similar animals in which morphometrics are used extensively, such as aphids and whiteflies. We consider the present study to be the first step toward realising mite identification through artificial intelligence. Based on the application potential, more species and individuals that more possibly provides the entire ranges and variations of morphometric features before XGBoost identification can become more reliable in all situations. The practical application of artificial intelligence is expected to overcome all identification problems for IPM in the future.

Methods
Sampling of N. barkeri and related species. Phytoseiid mites inhabit a variety of habitats, such as various plants and soil litters. Individuals were collected from plants and those on substrates and soil litters were isolated using Berlese' funnels and kept in 95% alcohol. Samples were mounted in Hoyer's medium and softened and cleaned with lactic acid if the mite body was hard. In addition, specimens were deposited at several institutes: GIABR (Guangdong Institute of Applied Biological Resources, Guangzhou, Guangdong, China), HUM (Hokkaido University Museum, Sapporo, Japan), NMNS (National Museum of Nature and Science, Tsukuba, Japan), NTU (Department of Entomology, National Taiwan University, Taipei, Taiwan), TARL (Taiwan Acari Research Laboratory, Taichung City, Taiwan). Female phytoseiid mites were collected, including 250 specimens of N. barkeri, and 262 specimens of 35 non-target species belonging to subfamily Amblyseiinae, in 6 tribes, and 11 genera. The following numbers of these non-target species were collected: 4 of N. baraki, 10 of N. longispinosus, 10  Quantitative measurements of phytoseiid mites. Specimens were examined under an Olympus BX51 microscope, and measurements were performed using a stage-calibrated ocular micrometer and ImageJ 1.47 36 . Photos were taken using a Motic Moticam 5+ camera attached to the microscope ( Figure S1). All measurements were recorded in micrometres (μm). The general terminology used for morphological descriptions in this study conformed to that of Chant and McMurtry 20 . The notation for idiosomal setae conformed to that of Lindquist and Evans 37 and Lindquist 38 , as adapted by Rowell et al. 39 and Chant and Yoshida-Shaul 32 . Phytoseiid mites exhibit pronounced sexual dimorphism, and female individuals are more crucial for identification because of their distinguishing features and greater prevalence. In the present study, 22 quantitative measurements were collected from the female specimens: dorsal shield length and width; j1, j3, j4, j6, J5, z2, z4, z5, Z1, Z4, Z5, s4, r3, and R1 setae length; ventrianal shield length and width (at ZV2 level); JV5 length; St IV length; spermatheca calyx length, and spermatheca calyx width (Fig. 1, Table 1).
XGBoost training and computing. We used XGBoost to develop a classification system for target mite species and related species based on their morphological features. Among machine learning methods, XGBoost is the most efficient for implementing the gradient boosting decision tree algorithm from multiple decision trees, which are created successively. For each iteration, a tree enhances its predictive power by minimising the unexplained part of the last tree. First, we determined the number of decision trees through cross-validation. The original sample was randomly partitioned into five equally sized subsamples (Table S1). A single subsample and the other subsamples were retained for use as the validation and training data, respectively. Cross-validation was then performed five times, with each subsample used exactly once as the validation data. The number of decision trees allows the same level of performance to be achieved in training and validation. The number of decision trees was then used for the full dataset to create a final model, and key morphological features were selected for their relative importance. Next, we used ICE plots to indicate the determinative roles of these key features in classification. Plots in which one line represents one specimen indicate changes in predictions (of target species) that occur as a morphological feature change. We generated XGBoost and ICE plots by respectively using the R package "xgboost" 40 and "pdp" 41 .
Drawings. Hand-drawn illustrations (Fig. 1) were made under an optic microscope (Olympus BX51). These drawings were first scanned, then processed and digitized with Photoshop CS6 (Adobe Systems Incorporated, USA).