Deep learning and wing interferential patterns identify Anopheles species and discriminate amongst Gambiae complex species

We present a new and innovative identification method based on deep learning of the wing interferential patterns carried by mosquitoes of the Anopheles genus to classify and assign 20 Anopheles species, including 13 malaria vectors. We provide additional evidence that this approach can identify Anopheles spp. with an accuracy of up to 100% for ten out of 20 species. Although, this accuracy was moderate (> 65%) or weak (50%) for three and seven species. The accuracy of the process to discriminate cryptic or sibling species is also assessed on three species belonging to the Gambiae complex. Strikingly, An. gambiae, An. arabiensis and An. coluzzii, morphologically indistinguishable species belonging to the Gambiae complex, were distinguished with 100%, 100%, and 88% accuracy respectively. Therefore, this tool would help entomological surveys of malaria vectors and vector control implementation. In the future, we anticipate our method can be applied to other arthropod vector-borne diseases.


Material and methods
Anopheles collection and storage.The first WIPs reference collection of Culicidae gathers samples belonging to the Anopheles genus using well-established laboratory breeds of An. gambiae, An. coluzzii and An.arabiensis and An.stephensi (MIVEGEC, IRD Montpellier, France and IRRS Bobo Dioulasso, Burkina Fasso).Specimens were also selected in the ARIM collection (https:// arim.ird.fr/) of IRD (Institut de Recherche pour le Développement).In addition, specimens collected in natura, whose identification was performed at the time of their traping with available regional morphological identification keys, and confirmed before their entry in the ARIM collection, were also included in the database.The description of the samples used in this study is given in Table 1.

Image acquisition and database construction.
The same standard operational procedures (SOP) described to capture WIPs of Glossina were also used for Anopheles 27 .This process is easy to handle and inexpensive.It consists of dissecting the wings and mounting them on a glass slide.A cover slide was deposited, annotated specimens were photographed using the xVH-Z20r camera, and the VH K20 adapter (Keyence™) was set to 10°of illumination incidence.The function High Dynamic Range (HDR) was used for all pictures.All pictures were enlarged to get sized photos that exclude the wing size as a discriminating criterion for species identification by deep learning approaches.Geographical origin, sampling date, and the sex and identity of the filed-caught species and the entomologist who identified them in the sampling location were recorded individually.The numerical parameters of the camera were as follows: white Balance 3200 K, Shutter Speed 1/15(sec), gain 0db, frame rate 15F/s, brightness 15%, texture 15%, contrast 45%, color 100%.The luminosity, contrast, shadow, reflection, and saturation were settled at 80, 100, 0, 0, and 100% using Windows 7 familial edition.All pictures were dusted off manually before being filled in the database.
Collected dataset, image pre-processing, and dataset splitting for training/learning and validation.The annotated image dataset, including 843 pictures of 42 Anopheles species belonging to 3 subgenera (Anopheles, Cellia, Nyssorhynchus) were prepared to undergo learning Anopheles classification.For training purposes, the sample sex, geographic origin (population), age, and physiological state (blood feed or not) were not considered to get a general classifier model.The 4688 pictures of WIPs belonging to the Diptera family encompassing Glossinidae, Psychodidae, Culididae, and other genera were added.Under-sampled Anopheles species (less than ten samples/pictures) and An.multicolor were discarded for the training of Anopheles identification at the species level to prevent overfitting.Still, they were included in the training dataset at the subgenus level.All processed images were resized to 256 and 116 pixels for width and height, respectively.Pixel values were normalized within the (0,1) range.The dataset was then prepared for k-fold cross-validation, with k = 5, similar to what have been performed for Glossina sp WIPs analyses 27 .K-fold cross-validation is a classic approach to evaluate the robustness of a machine learning method, including Deep Learning ones.For this study, the dataset was randomly shuffled and partitioned into k equal-size subsets with similar class distributions.A separately www.nature.com/scientificreports/learned classifier was evaluated for each subgroup using the kth of all datasets for validation and the remaining k−1 as training data.This strategy allowed measuring the mean accuracy of the five distinct generated classifiers.Among all existing machine learning methods, Deep Convolutional Neural Networks and their different architectures have shown in the last decade to be the most adapted for image classification.Compared to classic shallow methods (Support Vector Machine, Random Forest, and Boosting-based approaches for the main ones), they do not need handcrafted features as input of the learning process: the selection of the best features is intrinsic to the method itself and is particularly well adapted to the particular scenario of WIPs.A pipeline overview of the complete training procedure using CNN is shown in Fig. 1.

Training of the convolutional neural network (CNN).
The original CNN architecture MobileNet 36 , ResNet 37 , and YOLOv2 38 architecture were deemed for the automatic classification with the abovementioned dataset.Compared to classic Deep Learning, ours is more compact to cope with our dataset's specificity in terms of size; therefore, thinner image recognition and classification architecture were developed to consider its reduced size.The first one is inspired by MobileNet, which takes advantage of depth-wise convolution 36 .We propose to work with only one depth-wise convolution per layer of the CNN architecture to reduce the complexity and the number of extracted features.In addition, batch normalization was set to speed up and stabilize the training process 39 .
In this first compact CNN architecture based on MobileNet, two interconnected layers like VGG 40 for YOLOv2 were applied with a DarkNet-19 38 architecture.As this kind of architecture tends to over-fit the training set (which means a lack of generalization of the performance when other data than the training data set is considered), we tested two reduced architectures, i.e., using 1 or 2 scales less than the original network.For clarity, we called them DarkNet-9 (8 convolution layers and one classification layer) and DarkNet-14 (13 convolution layers and one classification layer).We also reproduced the ResNet18 architecture 37 and trained it from random initialization.Even if this architecture seems too "deep" (may lead to overfitting) compared to our other architectures, the intrinsic properties of ResNet18, residual connections, allow convergence of the training procedure.Finally, a standard approach (shallow approach) based on extracting SURF descriptors (an efficient implementation of the classic SIFT descriptors), a Bag of Features (BoF) representation using a 4000 codewords dictionary, and an SVM Table 1.List of Anopheles species and description of samples included in the dataset.*Gambiae complex.**Nili complex.***Claviger complex.****Marshallii complex.$ Medical interest according to the WRBU database (https:// wrbu.si.edu/ vecto rspec ies? field_ family_ target_ id= 1194& title= & field_ mt_ produ cts_ tags_ target_ id= & field_ patho gens_ target_ id= & field_ geogr aphic_ locat ions_ target_ id= & items_ per_ page= 30) and Wilkerson et al. 35 .& ISO 3166-1 country code available at (https:// www.atlas-monde.net/ codes-iso/).0rigin: the sample's origin, W, wild; C, colony; N, number of picture in the database.www.nature.com/scientificreports/with a standard polynomial kernel similar to it was proposed in Sereno et al. 26 .For each task, we only use 1 fully connected layer with the softmax activation to predict the probability that an image belongs to the correct class.We train our networks using Stochastic Gradient Descent (SGD) with a learning rate of 10 2 and a momentum of 0:9 for 30 epochs.The method was developed on a workstation with a quad-core CPU at 3.0 GHz and 16Go RAM.Information on the training options, accuracy and sensitivity, as well as the code, are available at https:// github.com/ marce nsea/ dipte ra-wips/ commit/ 12f39 ab500 a3f82 0cfb8 17c67 ef25c 58094 2301d.
From the appurtenance probabilities matrix, an Euclidean distances (function dist in R package stats) distance matrix between pictures was computed.Then, from this distance matrix, a hierarchical cluster representation showing all photos of the test dataset was drawn using the average method of clustering (function hclust in R package stats) and plotted using the ape functions of the R package.

Results
Wing interferential pattern according to anopheles wings genera, species, sex, and date of sampling.Sexual dimorphism.WIPs significantly vary among specimens belonging to different species but moderately amongst specimens of the same species or between sexes (Fig. 3).WIPs were explored on the broader panel of Anopheles specimens available and from 3 subgenera over the eight currently described.We previously documented the conservation of the interferential pattern on the wings of Glossina according to the position of the radial symmetry (intrados/extrado) and axial symmetry (left and right) 27 .We also investigate the sexual dimorphism of WIPS in studied samples (Fig. 2).Sexual dimorphism of WIPs is documented for numerous dipteran families, including Culicidae, Glossinidae, Muscidae, Calliphoridae, Ceratopogonidae… [41][42][43] .Picture of WIPs disclosed that for the Anopheles specimens we examined, the sexual dimorphism is weak and difficult to delineate with the naked eye (Fig. 3).A more in-depth study would be necessary to investigate its presence.

Date of sampling.
Knowing that the sample dataset is filled with a variety of specimens collected and identified as early as 1951, we checked the stability of WIPs according to the age of collection of the specimen (Fig. 3).WIPs, when microscopically observable, appear unaffected by the conservatory period, allowing us to enrich our dataset with samples from the IRD collection.Although in some older samples and/or heavily damaged ones,  www.nature.com/scientificreports/WIPs cannot always be revealed.This lack of WIPs happened for 50% of specimens collected before the 80 s and preserved in the collection (Data not shown).

Training and classification.
We explored the training classifier accuracy on the Anopheles dataset and on datasets of Culicidae that do not belong to the Anopheles genus (non-Anopheles) and from mosquitoes that do not belong to the Culicidae family (non-Culicidae), as negative samples.We trained the CNN on such a combination to improve the model's accuracy.The database initially filled with a total of 843 pictures of Anopheles sp.WIPs pictures, 710 illustrating species documented with more than ten pictures and 133 with less than 10.Our dataset contains photos of species acting as primary vectors of viruses, or bacteria having a medical interest (Table 1).Our database is filled with 25 Anopheles sp, out of the 140 with documented medical or veterinary interests.Overall, the WIPs of 43 Anopheles species were filled in our dataset.However, only 20 species have encompassed the training process because at least ten pictures are available in our dataset.The other specimens were used only to train the classifier recognition at the genus (Anopheles) and subgenus (Anopheles, Cellia, Nyssorhynchus) taxonomic levels.
Classification at the genus level.Using this dataset, we first ascertained the accuracy of the process to discriminate the Anopheles genus Meigen, 1818, from other members of the Culicidae Meigen, 1818 family and belonging to the Culicinae Meigen, 1818 subfamily.These specimens belonged to the Culex Linnaeus, 1758; Lutzia Theobald, 1903; Aedes Meigen, 1818.Non-Culicidae sample members belonging to the Psychodidae, Glossinidae, and Ceratopogonidae families were also filled in the dataset to test the classification accuracy.The automatic classification process accuracy for Anopheles Meigen, 1818 is incredibly high, with more than 99% of accuracy (Table 2).A sole picture was badly classified as belonging to the Culex genus.
Classification at the subgenus level.In the second step, we investigate the capacity of our DL process to correctly address the identification of the specimens at the subgenus level.The training and testing dataset included a set of 833 pictures representative of 43 Anopheles species and three subgenera.Table 3 shows that the subgenus assignation accuracy is high for Cellia, moderate for Anopheles, and faint for Nyssorhynchus.The Cellia subgenus is documented by more species and pictures, followed by the Anopheles subgenus and the Nyssorhynchus.Therefore, the accuracy discrepancy ranging from 38.8% to 96.6% might be due to the low representativity of species and specimens for the Anopheles and Cellia subgenera.In addition, the selected descriptors and the training process might need to be revised to train an accurate classifier to identify Anopheles at the subgenus taxonomic level.These questions must be further addressed.
Classification at the species level.A circular dendrogram reflecting the proximity of each picture belonging to the Anopheles dataset was drawn (Fig. 4), depicting the presence of clusters.Some clusters match all the pictures of WIPs of the same species: An. mascarensis, cluster 2; An. darlingi cluster 3; An. listeri, cluster 11; An. punctimacula, cluster 12; An. pharoensis, cluster 15.Other clusters include all members of the same species plus some pictures of other related species: cluster 1, all An.funestus pictures plus one picture of An. barberellus; cluster 4, all An.obscurus plus one picture of An. nili Theobald, 1904 and An.paludis; cluster 7 gathering all An.arabiensis but including six extra-specie pictures and cluster 14, all An.gambiae plus three pictures of An. colluzzii.The last www.nature.com/scientificreports/category of clusters included most pictures of a species, but not all of them: cluster 5, mainly An. paludis; cluster 6, primarily An. maculipennis; cluster 9, mainly An. cinctus; cluster 10, mainly An. squamosus; cluster 13, mostly An. coluzzi and cluster 8, gathering the An.multicolor photos with three pictures of An. cinereus.Notably, no species out of the 20 under study scattered into more than two clusters.Finally, the reliability of the DL model to accurately classify WIPs pictures of 19 Anopheles species was calculated, and results are presented in Table 4. Variable level of accuracy is recorded, ranging from faintly (50.00%) to perfect classification (100.00%).A perfect accuracy (100.00%level) is achieved for ten species whose WIPs pictures were filled in the dataset.More than 50% of accuracy in classification is recorded for three species, but the DL methods failed to assign 7 Anopheles species with an accuracy superior to 50% (Table 4).For most of the species whose assignation accuracy falls below 70%, a low number of representative pictures is available; indeed, only a small number of pictures are available for the test process (An.demeilloni 2, An.maculipennis 4, An.barberelus, 2, etc.).More than ten pictures per species might be a prerequisite to get good accuracy with our process; this will be further investigated.Only 14 pictures of the test dataset were misclassified (Fig. 2B), and the computed specific recognition of Anopheles remains astonishing, considering our dataset's species richness.
The Anopheles genus encompasses numerous morphologically indistinguishable species, ranging into the species complex level, e.g., 'morphologically similar or identical natural populations that are reproductively isolated' .According to this definition, 27 species complexes are currently described for Anopheles.Our dataset gathers specimens from 4 complexes, the Claviger, Gambiae, Marshallii, and Nili complexes.Anopheles nili belongs to a complex of 4 species (An.nili; An. somalicus Rivola & Holstein, 1957; An. carnevalei Brunhes, Le Goff & Geoffroy, 1999; An. ovengensis Awono-Ambene, Kengne, Simard, Antonio-Nkondjio & Fontenille, 2004).Unfortunately, we cannot address the accuracy of the identification process for 3 of them because species were documented in our dataset with less than ten pictures for the Marshallii and Claviger complexes, or only one species for the Nili complex.
As early as 1968, morphological variations in An. nili s.l populations suggest that An. nili is a complex of species whose members were further identified 44,45 .Our set of pictures of An. nili, gathers specimens date back to 1966 and might, therefore, encompass species belonging to the Nili complex not described before shreds of evidence for the presence of this complex.In addition to the few pictures available to train the classifier, other underlying factors might result in the fair identification accuracy we recorded.For other mispredicted pictures, the small number of samples available and the age of specimens age might have altered the prediction approach's www.nature.com/scientificreports/power.Even if An. maculipennis can be misidentified as An.arabiensis see Fig. 5, these two species are not sympatric in their natural environment.
The "Gambiae complex 9 ", first described in 1960, is documented in our dataset by four species over the nine currently described.Nevertheless, fewer than ten pictures are present in the dataset for one species.Nevertheless, our DL approach on WIPs demonstrates an astonishing identification accuracy of 100% for An.arabiensis and An.gambiae and 88% for An.coluzzii (Table 4).All specimens were collected from colonies avoiding misidentification ambiguity.Strikingly An. coluzzii is frequently misidentified as An.gambiae but never as An.arabiensis.It appears that An. gambiae and An.arabiensis are always correctly identified.
Overall photos of mispredicted species (Fig. 5) show that the samples of An. obscurus miss-predicted as a non-Anopheles specimen is of interest since this specimen bears wings characters of Anopheles, smooth and patchy areas on the wing costa and subcosta.This sample originated from Congo and was collected in 1988; the interferential pattern was still present but appeared slightly degraded during the preservation period.Such modification might have affected the recognition process, and it is documented that some slight picture modifications (blur lens, etc..) can significantly alter the recognition accuracy of our process 27 .The An. coluzzii misidentified as An.gambiae presents morphological alteration with damage on the wing; nevertheless, this hasn't prevented a correct classification at the genus and subgenus taxonomic level.

Discussion
In this study, we present clues on the accuracy of WIPs with DL to identify Anopheles specimens at various taxonomic levels, genus of subgenus, species, and complexes.Our results reveal that WIPs generated at the surface of Anopheles wings are a proper fingerprinting method to decipher specimens' identity at taxonomic levels of interest for the entomological survey and vector control follow-up.
Since the 2010s, WIPs (Wing Interference Patterns) have received significant attention for their potential as a diagnostic method for species identification, used in taxonomic and systematic studies 23,25,46 .The transparent wings with a thin membrane, i.e., mainly in small insects, allow the formation of a colored pattern via thin-film interference.In a dark and light-absorbing environment with incoming external light (sunshine, for example), conspicuous WIPs are displayed on the wing membranes.These WIPs significantly vary among specimens belonging to distinct species but moderately between specimens of the same species or between sexes.The observed newton color series is similar to that appearing on a soap bubble and is directly proportional to the thickness of the wing membrane at any given point.Unlike the angle-dependent iridescence effect of a flat film, wing structures in an insect's thin wing membrane act as diopters ensuring the WIPs appear essentially non-iridescent 23 .The role played by WIPs on sexual selection in Drosophila melanogaster was addressed, demonstrating  that males with more vivid wings are more attractive to females than males with dull wings.These experimental results add a visual element to the mating tool array of Drosophila 47 .The role of WIP during courtship points to a function during insects' speciation.This point is interesting and will deserve further exploration for explaining assortative mating of the Gambiae complex members in their natural environment.The genus Anopheles encompasses eight subgenera, Anopheles, Baimaia, Cellia, Christya, Kerteszia, Lophopodomyia, Nyssorhynchus, and Stethomyia.The largest cosmopolitan genera are Anopheles and Cellia.From a malaria transmission standpoint, a relatively small number of species of the Cellia subgenus (i.e., the Gambiae complex) are responsible for most of the world's malaria transmission; for a broader entomological survey point of view, more than 100 species of Anopheles are of medical and veterinary interest.With about 500 inventoried and validated species, accurately identifying Anopheles is challenging, even using published identification keys [48][49][50][51][52] .The presence of species complexes further puzzled the survey in areas where vector and non-vector species belonging to the same complex are sympatric.The two most known examples of such complexes are the An.maculipennis complex, with at least nine species in Europe 45 , and the Gambiae complex, with nine species in Africa.Besides diversity, microscopic observation is a time-consuming and challenging process, mainly owing to the skills and experience of dedicated public health personnel.In addition, variability in the morphological characteristics of www.nature.com/scientificreports/mosquitoes collected in the fields may be degraded due to discoloration or damages caused during the capture and processing at the study site or during the freezing and drying preservation protocols.Methodologies relying on genetic or biochemical criteria were tested to overcome such identification challenges.The DNA employs short molecular sequence tags from standardized genomic regions for species identification.Actually, 16,948 records forming 378 clusters, are available for the Anopheles genus (http:// v4.bolds ystems.org/ index.php request performed on 12/7/2022).DNA barcoding can complement the morphological assessment of specimens but present several flattens and needs to be better suited for field entomological surveys 53,54 .Biochemical markers include protein profiling using MALDI-TOF analysis or other biochemical characteristics of the sample, like the cuticle carbohydrate composition and chemical formula 55 .The MALDI-TOF profiling was first applied with relative success to a restricted number of Anopheles species (An.albimanus, An. minimus, An. freeborni, An. farauti, An. atroparvus, An. funestus), but including members of the Gambiae complex (An.quadrimaculatus, An. merus, An. gambiae, An. arabiensis), using head and thoraces of females mosquitoes 56,57 .This methodology was further applied to some neotropical anophele vectors (An.albimanus, An. apimacula, An. aquasalis, An. darlingi, An. malefactor, An. nuneztovari, An. pseudopunctipennis, An. punctimacula) with an identification success between 78 to 100%, comparable to our accuracy rate 58 .Nevertheless, few works were subsequently performed on Anopheles specimens with protein profiling 59 .Altogether, these methodology helps to solve some taxonomic and ambiguous identification problems but could not be amenable for entomological survey purposes due to their cost, requirements in infrastructures and material, and trained personnel.Infrared spectroscopy (NIR and MIR) can detect changes in mosquito cuticles by quantifying light absorbed 60 .The discriminative capability of such methodology at the species level has yet to be thoroughly investigated and is restricted to very few members of the Anopheles genus, i.e., An. gambiae s.s, An. coluzzii, and An.arabiensis but appears to be well fitted for age grading of populations 34,[60][61][62][63][64][65] , but also, interestingly, on pathogen (Plasmodium) detection within the arthropod vector 66 .Here, we provide clues on the reliable Anopheles species identification using WIPs and the DL process.We identified some species with 100% accuracy, even those belonging to the Gambiae complex of species 34,62 .This precision is higher than those provided by MIR or NIR technology.We also provide pieces of information on the capability of this method to be successfully translated on field-collected samples and old specimens.In addition, our methodology allows for identifying specimens at various taxonomic levels and, even for damaged specimens, addressing classification at the genus and/or subgenus levels.This is of interest for medical entomology purposes, knowing that species having a medical or veterinary interest are gathered in four out of the eight subgenera described.It might also be helpful for taxonomic studies involving old specimens.
The advance in Deep learning (DL) processes have opened a new perspective for arthropod identification.This branch of machine learning has the versatility to be employed on various markers of use in entomology, including protein profiling and image analysis for morphological characteristics.The latter, which includes typical morphological characteristics used to identify Culicidae specimens, can potentially be used in "citizen sciences" projects.Such community surveillance has been applied for mosquitoes 67,68 , and a citizen science approach in conjunction with a deep learning method was developed to follow Aedes albopictus (Skuse, 1895) from pictures taken by citizen 69 .Nevertheless, for instance, the accuracy of such methods has not been evaluated in areas with high Culicidae diversity or for Anopheles recognition.Nonetheless, we can anticipate that this process will suffer from the same limitation in identifying species belonging to the same complex.Although DL approaches have also been applied for training Anopheles belonging to the Gambiae complex identification and age grading using MIR 34 , its interest for a taxonomic purpose has not been thoroughly probed.
Therefore, it will be of interest to further explore the capacity of WIPs in couple with DL to address challenges concerning delineating geographically distinct populations, sex, physiological state identification, and its versatility to perform age grading in natural populations, if any.We anticipate this method can be applied to other arthropod vector-borne diseases.Assuming that Deep learning analysis results in robust classification outcomes, it is worth evaluating, even qualitatively, whether the proposed approach could be scalable and usable in real-life conditions regarding several essential criteria: cost (infrastructure, material, technically skilled personnel), computational resources, analyzing time, sample destructiveness, and taxonomic level of the classification.

Figure 1 .
Figure 1.Schematic representation of the pipeline process developed for Anopheles identification using the Convolutional Neural Network approach.Example of classification output with the associated probability.The class of a given Anopheles WIPs image is predicted by two steps: (1) extracting hierarchical features (Convolutional layer) and (2) classifying these features (Fully-connected layer and softmax layer).In the feature extractor part, feature maps generated by filters at each convolution layer are indicated.These feature maps are used for visualization by weighting them with channel-wise averaged gradients.

Figure 2 .Figure 3 .
Figure 2. Wing interferential Patterns for male and female specimens of some Anopheles species.

Table 2 .
Accuracy tests of the DL (Deep Learning) process for the Anopheles (Meigen, 1818) genus assignation.Accuracy values are in bold.Ac accuracy, N number of pictures.

Table 4 .
Accuracy tests of the deep learning at the species level.C class number, Nb number of samples tested, NA not ascertained.