Detecting common coccinellids found in sorghum using deep learning models

Wang, Chaoxin; Grijalva, Ivan; Caragea, Doina; McCornack, Brian

doi:10.1038/s41598-023-36738-5

Download PDF

Article
Open access
Published: 16 June 2023

Detecting common coccinellids found in sorghum using deep learning models

Chaoxin Wang¹,
Ivan Grijalva²,
Doina Caragea¹ &
…
Brian McCornack²

Scientific Reports volume 13, Article number: 9748 (2023) Cite this article

1499 Accesses
3 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Increased global production of sorghum has the potential to meet many of the demands of a growing human population. Developing automation technologies for field scouting is crucial for long-term and low-cost production. Since 2013, sugarcane aphid (SCA) Melanaphis sacchari (Zehntner) has become an important economic pest causing significant yield loss across the sorghum production region in the United States. Adequate management of SCA depends on costly field scouting to determine pest presence and economic threshold levels to spray insecticides. However, with the impact of insecticides on natural enemies, there is an urgent need to develop automated-detection technologies for their conservation. Natural enemies play a crucial role in the management of SCA populations. These insects, primary coccinellids, prey on SCA and help to reduce unnecessary insecticide applications. Although these insects help regulate SCA populations, the detection and classification of these insects is time-consuming and inefficient in lower value crops like sorghum during field scouting. Advanced deep learning software provides a means to perform laborious automatic agricultural tasks, including detection and classification of insects. However, deep learning models for coccinellids in sorghum have not been developed. Therefore, our objective was to develop and train machine learning models to detect coccinellids commonly found in sorghum and classify them according to their genera, species, and subfamily level. We trained a two-stage object detection model, specifically, Faster Region-based Convolutional Neural Network (Faster R-CNN) with the Feature Pyramid Network (FPN) and also one-stage detection models in the YOLO (You Only Look Once) family (YOLOv5 and YOLOv7) to detect and classify seven coccinellids commonly found in sorghum (i.e., Coccinella septempunctata, Coleomegilla maculata, Cycloneda sanguinea, Harmonia axyridis, Hippodamia convergens, Olla v-nigrum, Scymninae). We used images extracted from the iNaturalist project to perform training and evaluation of the Faster R-CNN-FPN and YOLOv5 and YOLOv7 models. iNaturalist is an imagery web server used to publish citizen’s observations of images pertaining to living organisms. Experimental evaluation using standard object detection metrics, such as average precision (AP), AP@0.50, etc., has shown that the YOLOv7 model performs the best on the coccinellid images with an AP@0.50 as high as 97.3, and AP as high as 74.6. Our research contributes automated deep learning software to the area of integrated pest management, making it easier to detect natural enemies in sorghum.

Aphid cluster recognition and detection in the wild using deep learning models

Article Open access 17 August 2023

Goosegrass Detection in Strawberry and Tomato Using a Convolutional Neural Network

Article Open access 12 June 2020

CPD-CCNN: classification of pepper disease using a concatenation of convolutional neural network models

Article Open access 20 September 2023

Introduction

Pest management is a strategy used to control any living organism that poses a risk to our food, fiber, and health security. It has played an essential role in achieving the current food supply, and its role will continue to be critical in any agricultural production system¹. Since the beginning of agricultural development, growers have had to compete with harmful insects, collectively called ‘pests’². These organisms can reduce crop yields and fruit quality, damage plants, serve as disease vectors, and contaminate food crops. Different strategies have been developed to control arthropod pests in agriculture, including chemical, cultural, biological (e.g., plant resistance, natural enemies, etc.), and mechanical methods³. There is much concern about the use of chemicals for pest control due to their cumulative non-sustainable adverse effects on the environment^4,5,6, particularly non-target effects on beneficial organisms, including natural enemies (e.g., predators, parasitoids, microorganisms) and pollinators^7,8, and the potential for the development of pesticide resistance^9,10.

A valuable tool for addressing this threat is integrated pest management (IPM). IPM was developed in the early 1970s as a pest control strategy that promotes sustainable agriculture with a strong ecological basis¹¹. IPM is an approach that incorporates various tactics to control all classes of pests (e.g., insects, pathogens, weeds, vertebrates) to create an ecologically and economically efficient production system¹¹. These tactics include biological control, cultural practices, host-plant resistance, genetic manipulation, and pesticides^3,12,13. IPM tactics have been applied in different crops, including sorghum production. Sorghum [Sorghum bicolor (L.) Moench] is the fifth most valuable cereal crop globally¹⁴. In the U.S., this crop had a value of more than $1 billion and was planted on 5.26 million acres in 2019¹⁵. Sorghum production in the world is used mainly for human consumption and animal feed; in the U.S., it is used as livestock feed and turned into ethanol. However, the current production of sorghum faces significant pest management challenges. Since 2013, with the outbreak of Melanaphis sacchari (Zehntner) (Hemiptera: Aphididae), commonly named sugarcane aphid (SCA), different tactics have been developed, including scouting protocols, pesticides treatment guides, and host plant resistance programs to prevent yield losses in sorghum¹⁶.

Proper identification and classification of insect pests at an early stage are important tasks in crops because pest management strategies (i.e., pesticides and cultural control methods) can be costly and overused when misidentification happens. However, insect pests are not the only factor that affects our understanding of pest management in agriculture. Sorghum farmers encounter other beneficial insects that need to be identified and classified automatically to better understand the pest and beneficial insects interactions (i.e., predation) during pest scouting in fields. One of the major communities feeding on SCA are lady beetles (Coleoptera, Coccinellidae). Common genera, species, and subfamily levels of coccinellids that we can find on sorghum plants include Coccinella septempunctata, Coleomegilla maculata, Cycloneda sanguinea, Harmonia axyridis, Hippodamia convergens, Olla v-nigrum, and the subfamily Scymninae¹⁷. Technological advances in artificial intelligence and machine learning related to how living organisms can be most efficiently identified and classified with the smallest use of labor and time to increase precision agriculture represent a major focal point of modern agricultural production and research.

Machine learning is a sub-field of artificial intelligence, in which labeled data can be used to train a model, and the trained model can be subsequently used to make inferences and predictions on new incoming data without additional programmatic effort¹⁸. Convolutional neural networks (CNN)¹⁹ represent a type of machine learning model, more specifically, a deep learning model, which can be used to analyze visual imagery. CNNs excel at a variety of computer vision tasks, such as image classification, object detection and localization, among others. Object detection refers to the task of identifying and classifying instances of objects of interest in images or video frames²⁰. CNN-based approaches for object detection extract features from the input image and use the features to perform two main tasks: 1) detect regions of interest (ROI) as bounding boxes that contain instances of objects in the image (a.k.a., object identification); and 2) classify ROIs into an arbitrary number of classes (a.k.a., object classification). Depending on how these two tasks are performed, object detection approaches can be classified as one-stage detectors and two-stage detectors²¹. One-stage detectors perform both tasks simultaneously in one stage, and include models in the YOLO family^{22,23,24,25,26,27,28}, among others. Two-stage detectors identify ROIs in a first stage, and subsequently classify the ROIs and refine their bounding boxes in a second stage. Two-stage detectors include models such as Faster R-CNN²⁹, Cascade R-CNN³⁰ and FPN³¹. Traditionally, two-stage detectors have been more accurate than one-stage detectors, while the one-stage detectors have been faster and more suitable for use in practical applications that require real-time object detection³². Remarkably, combinations of the Faster R-CNN²⁹ and FPN³¹ networks have achieved competitive results on popular benchmark datasets^21,33. However, some of the recent YOLO models^27,28,34 have produced state-of-the-art performance both in terms of accuracy and speed on benchmark datasets, with YOLOv7 being known to produce the best results at the end of year 2022²⁸.

Deep learning software for object detection can be designed in a user-friendly manner and allows for the training of models that can be applied to solve agricultural challenges¹⁸. Recent studies using deep learning neural networks for object detection have shown that it is possible to develop models for automated disease identification and insect recognition^35,36,37. Some studies have focused on the use of object detection approaches to identify and classify pests based on images of yellow sticky traps and other types of insect traps^{38,39,40,41,42,43,44,45,46,47}. For example, Salamut et al.⁴⁰ focused on detecting cherry fruit flies based on yellow sticky trap images. Several one-stage and two-stage object detection approaches were compared, including Faster R-CNN and YOLOv5²⁶ using a dataset that contains 1,600 annotated images. The best results overall were obtained using a Faster R-CNN model with lightweight MobileNet⁴⁸ as the backbone network. Specifically, the Faster R-CNN model had average precision AP@.50 of 0.88% as compared to the best YOLOv5 model, which had an AP@0.50 of 0.76%. Wang et al.⁴² published a dataset (called Pest24) of approximately 25,000 pest trap images that contain 24 field pests. They trained several object detection models on this dataset, including Faster R-CNN (with VGG-16 as the backbone network), Cascade R-CNN (with ResNet-50-FPN) and YOLOv3²⁴ (whose backbone is called Darknet-53). Experimental results showed that YOLOv3 had the best performance on this dataset, with an overall mean average precision (mAP@0.50) of 59.79% as compared to an mAP@0.50 of 57.23% for Cascade R-CNN and an mAP@0.50 of 51.10% for Faster R-CNN. Li et al.³⁸ used the Faster R-CNN model pre-trained on the COCO dataset⁴⁹ to detect small pests (whitefly and thrips) using a dataset of approximately 1,500 sticky trap images and showed that the model transferred from COCO is more accurate than the corresponding model trained directly on pest images.

Wang et al.⁵⁰ adapted the Faster R-CNN model to make it easier to find small pests in light-trap images. The improved model used the attention mechanism⁵¹ to focus on more predictive features, together with a sampling strategy for the region proposal network to address class imbalance and also an adaptive RoI selection to select best features from different levels of a pyramid network. Experimental results on a dataset (called AgriPest21) of approximately 25,000 images with 21 types of pests showed that the adapted model achieved a mAP of 78.7%, which was significantly better than the mAP of the baseline models included in the comparison study (both one-stage, e.g. SSD⁵² and two-stage models, e.g., Cascade R-CNN³⁰). Jiao et al.⁵³ also used an adaptive feature fusion pyramid network to identify richer features for pest detection together with Faster R-CNN network (with ResNet50 as backbone) and obtained a competitive mAP value of 77.4% on the AgriPest21 dataset⁵⁰. Zhang et al.⁴⁴ used strategies similar to those in^50,53 (i.e., attention mechanism to obtained better features, fusing features from a pyramid network) to adapt YOLO models to small pest detection tasks. Experimental results on the Pest24 dataset⁴² showed that the adapted YOLO model (called AgriPest-YOLO) had better performance than Faster R-CNN, Cascade R-CNN and several YOLOv4²⁵ and YOLOv5²⁶ variants, producing an overall mAP@0.50 of 71.3% and mAP@0.50 : 0.5 : 0.95 of 46.9%.

As opposed to the abovementioned studies that focused on images of trapped pests, other studies have focused on pest detection in the wild^{54,55,56,57,58,59}. Sava et al.⁶⁰ experimented with Faster R-CNN and YOLO models for detecting the brown marmorated stink bug (i.e., Halyomorpha halys) in tree images. Experimental results on a dataset of images assembled from the the Maryland Biodiversity Project⁶¹ showed that the YOLOv5m variant produced the best results, with an mAP of 99.2%, as compared to the Faster R-CNN which had an mAP of 89.1%. In contrast to that, Takimoto et al.⁵⁴ showed that Faster R-CNN was better than YOLOv4 for detecting herbivorous beetles, specifically, striped flea beetle (i.e., Phyllotreta striolata) and the turnip flea beetle (i.e., Phyllotreta atra) in a set of images collected from the web and through fieldwork. Similarly, Ozdemir and Kunduraci⁵⁷ also found the Faster R-CNN network (with Inception-v3⁶² backbone) to be better than YOLOv4 when used to detect and classify insects according to order level (using a dataset consisting of 25,820 training images and 1,500 test images). Butera et al.⁶³ also showed that Faster R-CNN (with MobileNet-v3⁴⁸ backbone) represents an effective model for detecting beetle-type pests (specifically, Popillia japonica) and also for distinguishing them from other types of non-harmful but similar looking beetles (Cetonia aurata and Phyllopertha horticola), giving an overall mAP of 92.66%. The dataset used contained 36,000 images collected from the web and photo sharing sites. Ahmad et al.⁶⁴ also used the web to assemble a dataset of 7,046 images which contain 23 types of pests. They experimented with a set of YOLO models and showed that YOLOv5-X gave the best results overall, with an mAP@0.5 value of 98.3%, and an mAP@.50 : 0.05 : .95 value of 79.8%.

In addition to work on deep learning for automated pest identification, recent studies have also focused on identification of beneficial insects such as pollinators and natural predators^{65,66,67,68,69}, including Coccinellidae beetles^70,71. Ratnayake et al.⁶⁶ used a hybrid approach that combines an object detection model (specifically, YOLOv2²³) with a background subtraction technique to identify and track honeybees in wildflower clusters. The proposed approach (called HyDaT), which can track one insect at a time, was tested on a dataset consisting of 22,260 video frames (with 17,544 bees visible) and it had a detection rate of 86.6%, as compared to a detection rate of 60.7% for YOLOv2. Ratnayake et al.⁷² extended the HyDaT approach⁶⁶ to make it is suitable for tracking multiple insects simultaneously. Their proposed approach (called Polytrack) uses YOLOv4 together with both foreground and background segmentation to identify and track honeybees. Experimental results on 39,909 video frames, including 5,291 frames with honeybees, showed that Polytrack achieved values of 0.975 and 0.972 for precision and recall, respectively, being superior to both HyDaT and YOLOv4 models used by themselves. Bjerge et al.⁶⁹ assembled a dataset consisting of 29,960 beneficial insects in nine taxa (such as bees, hoverflies, butterflies and beetles) and used the dataset to study the usability of YOLO models to accurately detect and classify such insects. Experimental results showed that the YOLOv5 model had the best performance with an mAP@0.50 : 0.05 : 0.95 of 0.592, and a best F1-score of 0.932. Similarly, Spanier⁶⁸ assembled a dataset of approximately 17,000 imaged of pollinator insects of eight types (including bees and wasps, butterflies and moths, beetles, etc.) retrieved from the iNaturalist (inaturalist.org) and Observation.org databases. The best performing model, a variant of YOLOv5, achieved an overall accuracy of 0.9294 and F1-score of 0.9294. Bjerge et al.⁵⁹ constructed a dataset of 100,000 annotated images containing small insects. The authors experimented with Faster R-CNN models and YOLOv5 models. To enhance the detection, they proposed a motion-informed-enhancement of the images. Experimental results showed that YOLOv5 achieved an mAP@0.50 value of 0.924, while the Faster R-CNN model achieved an mAP@0.50 value of 0.900.

In terms of coccinellid beetle detection, Venegas et al.⁷¹ used traditional image processing techniques (based on saliency maps, linear iterative clustering and active contour) to identify RoIs (bounding boxes) that can potentially contain coccinellids, and subsequently used a deep CNN to classify the RoIs as coccinellids or not-coccinellids. The approach was evaluated on a dataset of 2,300 coccinellid beetle images assembled from the iNaturalist project in Ecuador and Colombia. The RoI detection approach had an accuracy of 92%, while the CNN model had an area under the curve (AUC) of 0.977. Similarly, Vega et al.⁷⁰ used a CNN together with the weighted Hausdorff distance as a loss function to detect beetles in a dataset of 2,633 images similar to the ones used by Venegas et al.⁷¹, and reported a mean accuracy of 94.30%. While these works represent important first steps towards automated identification of coccinellid beetles (considered to be natural pest controllers), the realm of deep learning for object detection to automatically detect and classify coccinellids found in sorghum is largely unexplored.

The conventional manual identification of coccinellids requires expert skills and identification keys based on coloration and morphological characteristics. In contrast, existing automated tools based on digital technologies and imagery data do not employ state-of-the-art deep learning architectures and may not be very accurate⁷³. Thus, a vision-based automated system for image processing using deep neural networks needs to be researched for precise classification and identification of coccinellids to advance the integrated pest management area in sorghum. Towards this goal, we first assembled a dataset consisting of approximately 5,000 images retrieved from iNaturalist. The dataset assembled was used to study automated deep learning approaches to enable the detection and classification of coccinellids. We trained variants of the popular two-stage Faster R-CNN model, enhanced with FPN, a model referred to as Faster R-CNN-FPN. We also trained variants of the YOLOv5 and YOLOv7 models. We choose to focus on the Faster R-CNN-FPN model, given that this model has shown best performance in some prior related works^40,54,57. As backbone CNN, we explored ResNet-50 and ResNet-101 given that these networks commonly lead to a good trade-off between accuracy and speed⁷⁴. Similarly, we selected YOLOv5 as another strong model to experiment with given its best performance in several prior works^59,64,69. Finally, we also choose to include YOLOv7 in our study, as it gives best performance on several benchmark datasets²⁸ and it has not been explored for insect detection (neither pests nor beneficial insects) in the IPM area. To summarize, our research contributes a dataset and effective deep learning models trained to detect and classify coccinellids, including Faster R-CNN-FPN, YOLOv5 and YOLOv7 models. To the best of our knowledge, this is the first study to explore YOLOv7 for insect detection and classification. Our best models can potentially be installed and used on unmanned vehicles to automate the detection and classification of coccinellids in sorghum fields during field scouting. The models can be further customized to other natural enemies encountered in different crops during automated field scouting.

Methods

Deep learning approaches for object detection

The generic architecture of deep neural networks for object detection consists of two main components: a backbone, which is commonly a pre-trained CNN network used to generate feature maps, and a head, which is used to detect objects as bounding boxes defined by their coordinates (bounding box prediction) and to classify objects into one of several categories of interest²⁵, in our case, different types of coccinellids. One-stage detectors, including the YOLO family of detectors, have a dense prediction head that achieves the object detection and classification tasks simultaneously. Two-stage detectors, including the popular Faster R-CNN detector, decouple the object detection and classification tasks and achieve them in two stages. In the first stage, they use a dense prediction head to generate RoIs that may contain objects. In the second stage, a sparse detection head is used to classify the RoIs according to different object categories and to refine their bounding boxes. In recent years, it has become standard practice to insert a neck in between the backbone and the head of the network, to collect and mix features from different layers. The FPN network³¹ is one example of a neck that is commonly used in object detection networks. FPN uses a top-down path with lateral connections to extract semantic feature maps at different scales²⁵. The resulting feature maps enable the model to find objects at different scales. Path aggregation network (PANet)⁷⁵ is another example of a neck used in object detectors. It enhances FPN with a bottom-up path which helps propagate the low-level features. Equipped also with an adaptive feature pooling, PANet has been shown to improve object localization²⁵. The generic architecture of the one-stage and two-stage detectors is shown in Fig. 1. We study the popular Faster R-CNN as a representative two-stage approach and two YOLO variants, YOLOv5 and YOLOv7, on the task of detecting and classifying common coccinellid found in sorghum. All models studied were trained and evaluated using images annotated with the Labelbox tool (https://labelbox.com).

Faster R-CNN-FPN

Modern Faster R-CNN models use a pre-trained CNN as a backbone for feature extraction combined with an FPN network as a neck to obtain semantic feature maps at different scales. Extracted feature maps are provided as input to a region proposal network (RPN) which can be seen as the dense prediction head of the network. The RPN identifies Regions of Interest or RoIs (i.e., regions that may contain objects of interest - in our case, coccinellids) and their corresponding locations (i.e., rectangular bounding boxes parameterized using the box’s center coordinates, and its height and width). More precisely, the RPN uses a sliding window to generate three anchors with different aspect ratios (1:2, 1:1 and 2:1, respectively) at each grid cell in each input feature map. The anchors are labeled (as object/positive or background/negative) based on their overlap with ground truth bounding boxes and used to train the RPN network to identify RoIs and their locations. Highly overlapping regions, potentially corresponding to the same object, can be filtered using a non-maximum suppression (NMS) threshold. Subsequently, the resulting RoIs together with the feature maps are provided as input to a sparse prediction head which is trained to classify RoIs into several categories of interest (e.g., different coccinellid types) and refines their locations. All parameters of the network are trained together using a multi-task loss, which combines the cross-entropy classification loss with a linear regression L2 loss⁷⁶. We experiment with two CNN networks pre-trained on ImageNet⁷⁷ as the CNN backbone, specifically, ResNet-50 and ResNet-101 networks, given that they provide a good trade-off between accuracy and speed⁷⁴. The FPN produces features maps at 5 different scales, and consequently 5x3 anchors are generated at each location. Rezatofighi et al.⁷⁸ suggested that the standard L2 loss used to regress the parameters of the bounding box corresponding to an object is not strongly correlated with the IoU (Intersection over Union) metric generally used to evaluate object detection approaches. Instead of the L2 loss, they proposed to use a loss based on the IoU metric. Specifically, they experimented with a IoU loss and a loss based on a generalized IoU (GIoU), and showed that optimizing the GIoU loss helps improve the performance measured either using the GIoU itself or the standard IoU. Given this result, we experiment with the IoU and GIoU as the regression loss for the bounding box regression task in Faster R-CNN.

YOLOv5

As described above, two-stage detectors, such as Faster R-CNN-FPN network, re-purpose image classification to perform object detection by using the RPN to identify anchors that contain objects of interest as RoIs, and subsequently classifying the RoIs into specific categories. As opposed to that, one-stage detectors use directly the input image and to identify bounding box coordinates and class probabilities for objects of interest. YOLOv5 was released in 2020 by a company called Ultralytics²⁶ and has evolved over time. We used the latest YOLOv5 (v6.0/v6.1) architecture⁷⁹. The backbone for YOLOv5 is a New CSP-Darknet53 which combines the original Darknet53 network used in YOLOv3²⁴ with the CSPNet network⁸⁰. Darknet53 was inspired by the ResNet architecture and it was specifically designed for object detection. CSPNet addresses the issue of duplicate gradient information in large backbone networks by truncating the gradient flow to speed up computation. The current neck used in the YOLOv5 architecture consists of two components, SPPF and New CSP-PAN. SPPF is a variant of the Spatial Pyramid Pooling (SPP)⁸¹, which helps identify small objects and also objects at different scales. SPPF was designed to improve the computation speed of SPP. Similar to the Darknet53 backbone, the PAN network (PANet)⁷⁵ is also combined with CSP to improve computation speed. YOLOv5 uses a dense prediction head which is inherited from YOLOv3²⁴.

In addition to components that improve efficiency, YOLOv5 makes use of a variety of augmentation techniques on the input image. Among others, mosaic augmentation²⁵ is used to stitch together four images with the goal of training the model to find objects in places other than the center of the image, where a large majority of objects are generally located. Furthermore, YOLOv5 uses automatically generated anchors (with different scales and aspect ratios) to predict bounding boxes (and confidence scores) for each cell in a grid directly from the input image. The anchors are generated using k-means clustering based on the bounding boxes in the training set²³ and a genetic evolution algorithm that optimizes the initial k-means centroids based on the complete IoU (CIoU) loss⁸². The CIoU loss aggregates the overlap area, distance between center points, and aspect ratio consistency of two bounding boxes. The YOLOv5 head has 3 detection layers corresponding to three different scales and predicts bounding boxes with 3 different aspect ratios for each scale, resulting in a total of 9 anchors. The bounding boxes are predicted as deviations from the anchor dimensions. As in Faster R-CNN-FPN, the NMS technique is used to filter bounding boxes representing the same object. The whole network is trained using a multi-task loss, which combines classification loss (binary cross-entropy), objectness loss (binary cross-entropy) and location loss (CIoU). YOLOv5 uses an exponential moving average (EMA) of the model checkpoints as final detector. YOLOv5 itself represents a series of object detection models (compound-scaled variants of the same architecture) that have been pre-trained on the MS COCO dataset⁴⁹. Models in the YOLOv5 series have different sizes as applications have different needs in terms of the trade-off between accuracy and speed. In this study, we experiment with five YOLOv5 variants that vary in size from nano (YOLOv5n) to small (YOLOv5s), medium (YOLOv5m), large (YOLOv5l) and extra-large (YOLOv5x), whose specific architectures are available from the official GitHub repository²⁶ as .yaml files in the models directory.

YOLOv7

The YOLOv7 architecture has been designed based on the “bag-of-freebies” idea introduced by Bochkovskiy et al.²⁵, which refers to the fact that while it’s important for a detector to be fast at inference time, the training can be more expensive if it helps to improve the overall accuracy of the model (this is acceptable as the training is done offline). With this idea in mind, YOLOv7 introduced several innovations in network architecture and training strategies. One important innovation, the main component of YOLOv7’s architecture (used both in the backbone and neck/head networks), is a block called extended efficient layer aggregation network (E-ELAN). ELAN⁸³ uses a “stack in computational block” structure combined with CSP to optimize the shortest gradient path and ensure that scaling up the network does not result in deterioration of performance. In addition, E-ELAN uses “expand, shuffle, merge cardinality” which enables the model to learn more diverse features.

The neck of the network is structured based on a PAFPN network (a combination of PAN and FPN) which uses E-ELAN for feature extraction and fusion. In addition to the “lead head”, YOLOv7 introduces an “auxiliary head”, somewhere in the middle of the network, meant to assist the “lead head” (which may be too far down the network). Soft labels are assigned to the auxiliary and lead heads in a coarse-to-fine manner based on the predictions of the lead head and the ground truth. The coarse soft labels used by the auxiliary head represent a relaxed version of the fine soft labels used by the lead head, as it is expected that the auxiliary head is less precise. In terms of anchors, YOLOv7 leverages the automated anchor selection approach proposed in YOLOv5²⁶ and uses 3 aspect ratios for each of the 3 features maps representing three different scales (for a total of 9 anchors). Furthermore, data augmentation techniques similar to those used in YOLOv5 are also used in YOLOv7 (including mosaic augmentation), and the commonly-used NMS technique is employed to filter out the predicted bounding boxes. To achieve robustness through module-level re-parameterization (i.e., aggregating the weights of a multi-branch module during inference), YOLOv7 uses gradient flow propagation paths to “plan” what modules can benefit from re-parameterization.

YOLOv7 is trained from scratch on the COCO dataset using a multi-task learning loss consisting of the classification loss (binary cross-entropy), objectness loss (binary cross-entropy) and location loss (CIoU). If both the lead and the auxiliary heads are used, the loss includes similar components corresponding to the two heads, with different weights. The final model used for inference is based on an EMA of the model parameters at different checkpoints during training. Together, the innovations introduced in YOLOv7 and the components reused from prior works have led to state-of-the-art results on standard benchmark datasets for object detection²⁸, and at the same time, smaller inference time as compared with other YOLO models. YOLOv7 also consist of a series of pre-trained models of various sizes. In this study, we experiment with four YOLOv7 variants: 1) the standard YOLOv7 model designed for standard GPUs and 2) its compound scaling variant, YOLOv7-x; 3) the smallest model, YOLOv7-tiny deigned for edge GPU; and 4) a larger model, YOLOv7-d6, a cloud GPU architecture.

Dataset

Coccinellid imagery downloaded from the iNaturalist web portal was used (inaturalist.org). iNaturalist is a citizen science project that allows naturalists to upload and share observations (i.e., images) of biodiversity worldwide through a web platform and mobile app for free. Submission by observers include the actual images, their locations, observed time, and group identifications. Agreements on the taxa in the observations create a “research-grade” label that is assigned to the observation. iNaturalist makes an archive of research-grade observation data available to the environmental science community via the Global Biodiversity Information Facility (GBIF)⁸⁴. We used GBIF to assemble a dataset for training and testing deep learning models for the detection, localization and classification of coccinellids. Only research-grade labels at family, genus, and species level were considered in the dataset that we assembled. The dataset includes seven distinct categories of coccinellids corresponding to the most important coccinellids found in sorghum plants, specifically: Coccinella septempunctata, Coleomegilla maculata, Cycloneda sanguinea, Harmonia axyridis, Hippodamia convergens, Olla v-nigrum and the subfamily Scymninae¹⁷. Three sample images in each of these seven categories are shown in Fig. 2 (where each row corresponds to one coccinellid type).

We aimed to select approximately 700 images per category, and assembled a dataset with a total number of 4,865 images. Each image contains one or more instances of coccinellids of the same type (i.e., corresponding to a particular category). The dataset was split into train (3,053 images), development or dev (1,113) and test (699) subsets. Table 1 shows the distribution of the images/coccinellid instances over the seven categories in each of the train/dev/test subsets and also in the whole dataset. Supplementary Table S1 shows the distribution of the images with respect to the number of instances per image (1, 2, 3, 4, 5, 6, 7, or 8) for each category and for each of the the train/dev/test subsets. As can be seen, most images have only one or two coccinellid instances, although there are some images that have up to 8 coccinellid instances.

Table 1 Dataset statistics. Distribution of the images and coccinellid instances over the seven categories in each of the train/dev/test subsets and also in the whole dataset.

Full size table

We also classified the instances in our dataset according to their size, as this information is frequently used when evaluating object detection approaches⁸⁵. Specifically, instances are classified based on the area that they occupy in an image as small ($area \le 32^2$), medium ($32^2 < area\le 96^2$) or large ($area>96^2$)⁸⁵. Supplementary Table S2 shows the distribution of the small, medium and large instances in the train/dev/test subsets and in the whole dataset. As can be seen, the number of small instances is just 5 in the total dataset and they are all included in the training subset. The number of medium instance is 142, with 18 of those instances being in the test subset. The remaining 4995 instances are large and represent the majority in our dataset. Given this observation, our evaluation metrics will be generic (representing mostly the large category) as opposed to being specifically focused on small, medium and large categories, respectively.

Implementation details

We trained and evaluated Faster R-CNN-FPN, YOLOv5 and YOLOv7 models. The data flow diagram for our whole process is depicted in Fig. 3. Training images are used to train the model, while the development images are used to evaluate and select hyperparameters. The performance of the final models is estimated on the test data.