Comprehensive characterisation of Culicoides clastrieri and C. festivipennis (Diptera: Ceratopogonidae) according to morphological and morphometric characters using a multivariate approach and DNA barcode

Biting midges are widespread around the world and play an essential role in the epidemiology of over 100 veterinary and medical diseases. For taxonomists, it is difficult to correctly identify species because of affinities among cryptic species and species complexes. In this study, species boundaries were examined for C. clastrieri and C. festivipennis and compared with six other Culicoides species. The classifiers are partial least squares discriminant analysis (PLS-DA) and sparse partial least squares discriminant analysis (sPLS-DA).The performance of the proposed method was evaluated using four models: (i) geometric morphometrics applied to wings; (ii) morphological wing characters, (iii) "Full wing" (landmarks and morphological characters) and (iv) "Full model" (morphological characters—wing, head, abdomen, legs—and wing landmarks). Double cross-validation procedures were used to validate the predictive ability of PLS-DA and sPLS-DA models. The AUC (area under the ROC curve) and the balanced error rate showed that the sPLS-DA model performs better than the PLS-DA model. Our final sPLS-DA analysis on the full wing and full model, with nine and seven components respectively, managed to perfectly classify our specimens. The C. clastrieri and C. festivipennis sequences, containing both COI and 28S genes, revealed our markers’ weak discrimination power, with an intraspecific and interspecific divergence of 0.4% and 0.1% respectively. Moreover, C. clastrieri and C. festivipennis are grouped in the same clade. The morphology and wing patterns of C. clastrieri and C. festivipennis can be used to clearly distinguish them. Our study confirms C. clastrieri and C. festivipennis as two distinct species. Our results show that caution should be applied when relying solely on DNA barcodes for species identification or discovery.


Molecular analyses.
Results of molecular analyses. The sequences obtained are available in GenBank (Supplementary Information 1). Sequence alignments were 399 bp for COI and 587 bp for 28S including gaps. Phylogenetic analysis. Our molecular analysis ( Fig. 1) with both markers generated seven supported clusters, six of which were in agreement with the morphological determination (i.e. C. alazanicus, C. brunnicans, C. circumscriptus, C. furcillatus, C. nubeculous and C. pictipennis). However, one cluster (i.e. two species) corresponded to undistinguished C. clastrieri and C. festivipennis.
In addition, the COI mtDNA tree shows that C. furcillatus is the sister of the "C. clastrieri/festivipennis" clade. Indeed, C. pictipennis is the sister species of C. brunnicans while C. circumscriptus is positioned between the two clades.
Moreover, the 28S rDNA tree shows that C. pictipennis is the sister of the "C. clastrieri/festivipennis" clade. The other species are positioned in several places without a clade.
Finally, C. festivipennis and C. clastrieri-grouped in the same main clade-showed small interspecific distances (0.4 ± 0.2%); these were not identified as separate species based on DNA barcodes. We therefore decided to create a new group (C. clastrieri/festivipennis clade) based on interspecific distance. The overall mean genetic distance (K2P) computed for the different species of Culicoides was found to be 16.6 ± 1.4%. Interspecific K2P values for different (Table 1) species and taxa ranged from 27.3% (between C. furcillatus and C. nubeculosus; between C. circumscriptus-and C. furcillatus) to 17.2 ± 2.1% (between C. circumscriptus and the C. clastrieri/ festivipennis clade) for our samples. For the COI Genbank sequences, we observed approximatively the same proportion and the same species (Table 1). We remarked very little interspecific divergence between our sample of the C. clastrieri/festivipennis clade and the C. clastrieri/festivipennis Genbank clade (0.6 ± 0.4%). www.nature.com/scientificreports/ Analysis from 28S rDNA sequences did not show any intraspecific divergence whatever the taxa (0.000) with the exception of C. nubeculosus (0.1 ± 0.1%) and C. festivipennis/C.clastrieri (0.1 ± 0%). The overall mean genetic distance (K2P) computed for the different species of Culicoides was found to be 2.1 ± 0.03%. Interspecific K2P values for different species (Table 1) and taxa ranged from 1.2% (between C. circumscriptus and C. furcillatus; C. furcillatus and C. brunnicans, the main C. clastrieri/festivipennis clade and C. furcillatus) to 5.3 ± 0.9% (between C. circumscriptus and C. nubeculosus).
Principal component analyses. Principal component analysis (PCA) was used to observe possible grouping trends.
Firstly, we performed a first normed PCA using the "Wing landmarks" model. The first three axes accounted for 76%, 15% and 8% of the total variance, which suggests a weak structuration of the data. This was confirmed by a scatterplot of PCA axes 1 and 2 that was unable to separate the species (Fig. 3).
Secondly, we performed a first normed PCA on the "Wing morphological characters" model. The various specimens of each species are represented by a single point suggesting a close correlation of wing morphological characters. This model, without variance, is not validated and does not permit species separation.
We studied the "Full wing (landmarks and morphological, characters)" model through a normed PCA on raw data. C. clastrieri could be clearly separated from C. festivipennis. The first five axes accounted for 40%, 25%, 12%, 10% and 5% of the total variance. The scatterplot separated unambiguously and without overlap C. clastrieri-C. festivipennis on the one hand and the six species on the other hand (Fig. 3).
Finally, we performed a first normed PCA on the "Full model" (Morphological characters-wing, head, abdomen, legs-and wing landmarks). The first nine axes accounted for 26%, 23%, 22%, 10%, 8%., 4%, 3%, 2% and 1% of the total variance, which reveals good structuration of the data. This was confirmed by a scatterplot of PCA axes 1 and 2 that presents the same topology as the wing morphological model (Fig. 3).
This supports discrimination according to the species' wing pattern. Similarly, and some body pattern characters could be used to identify Culicoides from the clastrieri/festivipennis clade better and quicker. With that objective in mind, we performed analyses on three datasets: (1) "Wing landmarks" (11 landmarks); (2) "Full wing" (38 items) and (3) the "Full model" that includes 71 items.  www.nature.com/scientificreports/ Discriminant analyses. PLS-DA and sPLS-DA models were used in order to discriminate the extremes (i.e. the most sensitive and most robust groups) using the three datasets (species, models and components) as described. The accuracy and the balanced error rate (BER) for the two models were compared and are summarised in Supplementary Information 2 and Fig. 4. The tuning step of the number of components to select showed that 16 components were necessary to lower the BER (Fig. 4A,B) for the "Wing landmarks" data. The AUC values with 16 components are as follows: C. alazanicus (0.97, p < 0.001), C. brunnicans (0.98, p < 0.001), C. circumscriptus (1.00, p < 0.001), C. clastrieri (0.97, p < 0.001), C. festivipennis (0.89, p < 0.001), C. furcillatus (0.97, p < 0.001), C. nubeculosus (1.00, p < 0.001) and C. pictipennis (1.00, p < 0.001). After 16 components, the AUC values are approximately comparable (Fig. 4).
A perfect result would be an AUC of 1.0 obtained using eight components with PLS-DA ( Fig. 5) and seven with sPLS-DA using the "Full model" (Fig. 6). For the "Full wing" model, nine components with PLS-DA ( The most discriminating characters for the "Full wing" and "Full model" with PLS-DA and sPLS-DA are summarised in Supplementary Information 3 and 4 respectively.
With the PLS-DA classifier and the "Full model", we observe that a large number of characters are necessary to separate species. According to the components, 39

Discussion
The present integrative taxonomy study carried out on two closely-related and sympatric species, C. clastrieri and C. festivipennis, shows congruence between classical morphological identification and GM results. The molecular data revealed a joint C. clastrieri/festivipennis clade.
This paper reports a comprehensive evaluation of selected statistical classification techniques (PLS-DA and sPLS-DA) to discriminate species on the basis of four models: (i) "GM wings"; (ii) "morphological wing characters", (iii) "Full wing" model and (iv) "Full model". While these classifiers have been used in several scientific domains (particularly medicine) their performance had never previously been assessed on morphological and GM data for insects. sPLS-DA is clearly competitive in terms of computational efficiency and superior in terms of interpretability of results; it is therefore a good alternative to other types of discriminant models 33 .
It is interesting to note that the sPLS-DA classifier was able to separate species in the "Full wing" model and the "Full model" with AUC values of 1.0. The character classifier based on 14 items performed better than that based on 39 items for C. clastrieri with the full model. For C. festivipennis, just one item is necessary to discriminate this species. For the "Full wing" model, the sPLS-DA classifier can characterise C. clastrieri with five items (landmarks) while only one item (morphological character) is needed for C. festivipennis.
Our study, combining both GM and morphological characters, allows very good discrimination of our specimens. The classifiers were tailored so as to reduce the number of items needed to characterise species.
The "Full wing" (landmarks and morphological, characters) model separates the species without error. The dichotomous keys to species include all morphological characters and are used to identify Culicoides fauna by biogeographical region 7 . Moreover, identification aids based on wing patterns have been published for the same regions 7 . To our knowledge, only one study based on wing patterns has produced identification keys for epidemiological studies 9 . However, the final identification is the species or species group. Our study, based on both the morphological characters of wings (27 items) and the application of GM (11 points), allows all the studied species to be successfully separated, including cryptic species (C. clastrieri and C. festivipennis). Our study relies on the standardisation of morphological characters (http://www.iikcu licoi des.net). Future investigations are needed to create a worldwide database and to combine the GM approach for identifying Culicoides species.
DNA barcoding based on the COI and 28S sequences discriminated all morphologically determined species except C. festivipennis and C. clastrieri, which are not considered as separate species using these analyses. The identical C. festivipennis/clastrieri sequences with both COI and 28S, the rDNA gene assumed to be well conserved and thus better able to separate the species 34 , and a rare incidence of an overlap in wing landmarks, may indicate ongoing hybridisation 24 . Regarding the consequence of our DNA analysis, we consider that we observed a fragment of the Culicoides genome. Wing shape development in biting midges is probably also influenced by several genes and their expression. A previous study based on an immuno-enzymology assay showed that C. clastrieri and C. festivipennis vary by only one enzymatic character 35 but they could not be distinguished by COI barcode in the studies of Ander et al. and Sarvašová et al. However, the authors of these studies did not cast doubt on their specific status, and Ander 24 actually argued for possible ongoing hybridisation without any statement on their lineage divergence.
The GM technique allows species to be compared, not described. In mosquitoes, GM has been successfully applied in many studies investigating intra-and interspecific variations, parasite detection, sexual dimorphism, plasticity and deviation, separation of laboratory strains and genetic information 36 . In biting midges, the intersexual morphology specimens infested by parasites were detected by GM 28 . With respect to interspecific variation, the cryptic species seem to share approximatively 20% of the wing skeleton [26][27][28][29][30] . According to Sarvasova 22 , we should consider C. clastrieri and C. festivipennis as two separate species. Future research should focus on the development of nuclear markers with a higher-level phylogenetic relationship.
In conclusion, our study describes novel modelling techniques to evaluate species delineations within the Culicoides genus. Eight species, including two cryptic species, are clearly discriminated using selected statistical classification techniques (PLS-DA and sPLS-DA). Our study confirms C. clastrieri and C. festivipennis as two distinct species. sPLS-DA is clearly competitive in terms of computational efficiency and can separate cryptic www.nature.com/scientificreports/ species with fewer items than the PLS-DA classifier. We therefore propose to use combined morphological characters with a GM approach on wings, visible under a stereomiscroscope, to separate Culicoides species.

Materials and methods
Specimens and identification. Our study was conducted on 134 specimens from eight Culicoides species.
All the specimens were collected in France using UV traps (see sampling details in Supplementary Information  1). The wings, head and abdomen (with six segments) of individual midges were mounted in Euparal solution on microscope slides for morphological identification 30 . The thorax and legs were used for DNA extraction 37 . Preliminary species identification of the specimens was based on the morphological characters and wing patterns described in the identification key of Delécolle 8 and IIKC 10 .
The amplification conditions for D1D2 were as follows: after an initial denaturation step at 94 °C for 3 min, followed by 35 cycles of denaturation at 94 °C for 30 s, annealing at 58 °C for 90 s, and extension at 68 °C for 60 s, then a final extension at 68 °C for 10 min. For COI, the initial denaturation step at 95 °C for 15 min, then 5  Cleaned PCR products were sequenced by Genewiz, GmbH (www.GENEW IZ.com). PCR products were directly sequenced in both directions using the primers for DNA amplification. Sequences were corrected using the Pregap and Gap programs included in the Staden software package 40 . Alignments and nucleotide sequence diversity among the samples were obtained using Mega v6.0. with the Kimura-2 parameter 41 .  www.nature.com/scientificreports/ Additionally, the COI Genbank sequences of C. alazanicus, C. brunnicans, C. circumscriptus, C. clastrieri, C. festivipennis, C. furcillatus, C. nubeculosus and C. pictipennis were also included in our molecular analyses ( Supplementary Information 1).
Neighbour-joining (NJ) method (Kimura-2 parameter) analyses were performed for both markers with MEGA software version 7.0 42 . In order to exclude populations with the COI sequences, we divided up the species as follows: i) our specimens/species and ii) species-Genbank (Table 1). NJ trees of K2P distances were created (data not shown) to provide a graphic visualisation of clustering among different species 43 . Morphological characters. The list of morphological characters used are found on the web site http:// www.iikcu licoi des.net developed by Mathieu 10 . The raw dataset included 60 morphological characters (27 wings, 14 abdominal, 16 head and 3 leg characters) and eight species: C. alazanicus, C. brunnicans, C. circumscriptus, C. clastrieri, C. festivipennis, C. furcillatus, C. nubeculosus and C. pictipennis. For the statistical analysis, the morphological characteristics and species classification were coded as qualitative variables (see Supplementary  Information 5).
Geometric morphometric analysis. Digital images of the wings were obtained using an Olympus BX53 microscope equipped with an Olympus SC100 camera, under 10 X magnification. A set of 11 landmarks (Fig. 2) covering the wing surface was selected and recorded for each wing using the free CLIC software (https ://xyomclic.eu). Additionally, wings of species from the collection belonging to the Institut de Parasitologie et de Pathologie Tropicale de Strasbourg were included in our study ( Supplementary Information 1).

Statistical analyses.
Principal component analysis (PCA) was used to explore the correlation between variables and linear discrimination analyses (LDAs) used to predict individual species based on variable values.
PCA, PLS-DA and sPLS-DA were performed using the R package mixOmics (http://www.R-proje ct.org). PLS-DA is a supervised, multivariate modelling technique used to determine the variation within X (the morphological and landmark data), which is correlated with Y (the species). The sparse version of the technique, sPLS-DA, seeks to identify the best Kennard-Stone algorithm features that provide the best discrimination between two classes, ignoring all other features. sPLS-DA thus provides a framework for both feature selection and classification. To construct both PLS-DA and sPLS-DA, the dataset (Suppl. data) was divided into two subsets, one used for calibration (two thirds of the samples) and the other (the remaining third) used for external validation by the Kennard-Stone algorithm. The predictive ability of the final models was assessed using cross-validation. ROC analysis was used to determine optimal sensitivity and specificity to discriminate between species (with corresponding 95% confidence intervals). The area under the curve (AUC) showed the average prediction performances for the various decision thresholds. The closer the AUC is to 1, the more accurate the model 44  www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.