A growing scientific community now sees each organism as a community of interacting species rather than as an independent entity. Insects are no exception. They host a variety of microbial symbionts sitting both inside and outside their host cells. These microorganisms are at least as numerous as the number of host cells, and may constitute up to 10% of the host total mass1. The effects of symbionts on their insect hosts are potentially as diverse as their taxonomy, ranging from pathogenic to obligate mutualists, and all the intermediate possible relationships2. This diversity has recently attracted the growing interest of the scientific community, but gaps and biases remain. For example, in Lepidoptera, research in symbiosis has mostly focused on the most charismatic groups of colourful diurnal butterflies3,4,5 and on pest species to the human society6,7,8. In contrast, the rest of Lepidoptera (mainly moths), which encompass no less than 130,000 species9 (80% of all Lepidoptera), have rarely been screened for their associations with symbionts10.

High throughput sequencing technologies (HTS) now provide a relatively easy and cheap way to obtain large amounts of genetic data. These technologies used to generate genomic data are varied and broadly applicable to the widest range of organisms. Thereby, revolutionizing our accessibility to genomic resources and continually expanding and renewing the scope of the questions we can address within the natural sciences. For example, sequencing material from a particular study organism, either entirely or partially, may results in a mix of primary host specific DNA and DNA from other sources. These other sources can include ecto/endosymbionts, food, opportunistic parasites and pathogens, among others. Such genomic data opens up the genomic analyses towards broader targets, including towards investigating the diversity of symbionts that might be associated to particular targeted hosts.

Here, we mine the data produced from whole genome sequencing of 47 moth species from the hyper-diverse family Erebidae (24,000 species) to (1) explore the potential diversity of symbionts associated to this megadiverse Lepidoptera family; and (2) to evaluate the exploratory power of recovering information on natural host-symbiont associations from the low coverage genome sequencing approaches.


Metagenomic analysis

We identified the species Idia aemula, Luceria striata, Acantholipes circumdata and Oraesia excavata (RZ271, RZ42, RZ248, and RZ337) as infected by Wolbachia, and Wolbachia-associated phage WO (Table 1), with between 66,978 and 208,044 of the reads identified as belonging to the symbiont. Additionally, the reads obtained from sample RZ13 (Gonitis involuta) was also found to include 954 Wolbachia reads, which is a higher number of reads than found for any of the clearly uninfected specimens, but is considerably less than any of the four clearly infected specimens listed above. The mapping of the reads to two known Wolbachia reference genomes (wMel, GCF_000008025.1 and wPip, GCF_000073005.1) show a relatively homogeneous coverage of the reference genomes (Fig. 1) with mean coverages between 10 and 40 times the reference genome (Table 2). In the case of RZ13 sample, even though the coverage seemed homogeneously scattered through the reference genome, the mean coverage was lower than 1x (Table 2).

Table 1 The number of reads classified as originating from the host and various microorganisms.
Figure 1
figure 1

The mapped reads to wMel and wPip Wolbachia reference genomes. The coverage is shown on the vertical side of the figure. The top graphs (yellow) correspond to the sample RZ337 (Oraesia excavata), followed by RZ271 (Idia aemula in green), RZ248 (Acantholipes circumdata in grey), RZ42 (Luceria striata in purple) and at the bottom RZ13 (Gonitis involuta in blue).

Table 2 Samples screened for Wolbachia genomes.

Both Kraken2 and MetaPhlan2 analyses showed no to very few reads mapping to Cardinium, Hamiltonella or Spiroplasma bacteria, or to Microsporidian fungi, in any of the 47 datasets screened. In contrast, the specimens RZ103 and RZ111 (Rema costimacula and Platyjionia mediorufa) included considerably more reads from Sodalis bacteria (9108 and 4395, respectively), and from Arsenophonus bacteria (1336 and 662, respectively), than any other samples (maximum of 50 reads in any other sample). A closer look at the Kraken2 outputs from the latter two samples also revealed a possible infection with a Plautia stali symbiont (gammaproteobacteria; 3856 and 1914 reads, respectively), which was not detected in any of the other 45 samples. Additionally, the sample RZ30 (Creatonotos transiens) is the only one to show relatively high number of reads mapping to Burkholderia bacteria (N = 1995). Finally, we identified a considerable amount of reads from viruses of the Polydnaviridae family, and especially of the bracoviruses in three samples, Erebus ephesperis, Masca abactalis and Asota heliconia (RZ11, 1288 reads, RZ18, 1381 reads, and RZ44, 1384 reads). All other samples only include less than 750 reads, and more often no reads, for these viruses.

All details of the screen for the common symbionts can be found in Table 1, while all results from the Kraken2 and MetaPhlAn2 analyses can be found in the supplementary material and GitHub repository.


We confidently add four moth species (i.e., Idia aemula, Luceria striata, Acantholipes circumdata and Oraesia excavata) to the list of species hosting the intracellular alpha-proteobacterial symbiont Wolbachia10, confirmed through two screening methods (i.e., Kraken2 and MetaPhlAn). With only 4 out of 47 species (8%) found infected, this represents a lower infection rate than the current literature suggests (i.e., 16–79% of the studied lepidopteran groups infected with Wolbachia11,12,13,14,15,16). The general penetrance of Wolbachia however varies significantly among species, and is often low within infected populations17. Thus, with only one sample screened per species, our results are most likely underestimating the true infection rate within the Erebidae moths. Future broader screenings of different populations will provide more accurate natural infection rates for these species. Although microbial surveys in Calyptra thalictra18and Lymantria dispar19,20 did not highlight Wolbachia infections in these species, a recent screening of diverse moth species from Thailand, showed that two (22%; Olepa sp. and Creatonotos transiens) out of nine Erebidae species screened (ie. Amata sp., Asota plana, Creatonotos transiens, Euplocia membliaria, Fodina contigua, Neochera inops, N. dominia, Olepa sp., Pareuchaetes pseudoinsulata) were infected by the bacterial symbiont21.

Noticeably, we observe the presence of Wolbachia phage WO within the samples for which Wolbachia presence is strongly supported. The interaction of this bacteriophage with Wolbachia has been the focus of many evolutionary studies in recent years22,23,24,25,26. Previous research suggests that phage WO are associated with horizontal gene transfer in Wolbachia, and with genes that may affect the fitness of the bacterium27,28. These bacteriophages have been observed in practically all the studied genomes of Wolbachia up to date, with very few obligate mutualistic exceptions22,29,30. In the sample RZ13, species Gonitis involuta, a relatively high number of reads mapped to Wolbachia (1 K reads), although significantly lower than in the other four species (29–144 K reads), and no reads were mapped to phage-WO. In addition to the relatively lower sequencing depth compared to the other positive cases, few non-excluding hypotheses may explain such a pattern, these reads might originate from (1) contamination with other genetic material alien to our sample, (2) the integration of Wolbachia genomic material (partially or entirely) in the host genome, (3) random errors in the identification of the reads as Wolbachia, (4) low quality genomic material or (5) a combination of above-mentioned reasons. The overall screening results suggest that this sample was of low quality prior to sequencing. We however cannot rule out any of the other possibilities, and more studies are needed to fully confirm or reject the presence of Wolbachia in this species.

The two samples, Rema costimacula (RZ103) and Platyjionia mediorufa (RZ111), were of particular interests. Both the Kraken2 and the MetaPhlAn2 analyses suggest the presence of three gammaproteobacteria endosymbionts, namely Sodalis, Arsenophonus andPlautia stali-symbiont’ in both samples. Sodalis has been characterized from different insects, including tsetse flies31, seal louse32, pigeon louse33, loose flies34, aphids35, seed bug36, weevils37,38, stinkbugs39, bees40, and ants41, among others. To our best knowledge however, this is the first time the three symbionts are found in Lepidoptera (Duplouy and Hornett10). This suggests that Sodalis bacteria might affect a more diverse group of organisms than is currently known. We are however cautious with the interpretation of this result, as the simple discovery of bacteria in the genomic data does not inform us about the nature of their interactions with the hosts. Whether Sodalis and the moth species share a symbiotic relationship, or not, will only be confirmed via experimentation and testing of the partnership through the host generations. Contamination of those two samples prior to DNA extraction is always possible. However, the sequenced host genetic material did not include significant amount of hemipteran DNA (or any other non-lepidopteran insect order), with comparable low numbers of reads (< 1500) mapped to hemipterans in all the sequenced genomes. This rules out DNA contamination by material from the previously confirmed hemipteran hosts of these three symbionts. It is shown that the female brown-winged green bug, P. stali, smears excrement over the egg surface during oviposition. The nymphs acquire the symbionts right after hatching by ingesting the excrements42. Therefore, a possible contamination source could be any contact with such excrement/egg clusters. Once again, studies of the symbionts in natural populations of these moth species are needed to fully resolve the true infection state of these species and the relationship with the bacteria.

The moth species Creatonotos transiens shows a potential partnership with proteobacteria Burkholderia sp. Recently, Boonsit and Wiwatanaratanabutr21 found Wolbachia in 75% of the C. transiens samples they screened for (N = 6/8). Their samples were collected from Thailand, while the C. transiens specimen we analysed in this study originated from Hong Kong. In Lepidoptera, Burkholderia are known from the microbiota associated with the moth Lymantria dispar43. However, similarly to the other symbionts presented above, these bacteria are also found in very diverse groups of organisms, from Amoebas to Orthoptera, from humans to plants44,45,46,47. In the bean bug, Riptortus pedestris, studies have suggested that the bacteria can benefit their host by providing resistance to pesticides48. Although never tested, the presence of such Proteobacteria in moths could similarly enhance the host ability to resist pesticides. If proven true, this could contribute to partially explaining the global success of many pest moth species despite the development of various targeted control strategies.

Six genomes included significantly high amounts of bracovirus reads, Erebus ephesperis (RZ11), Masca abactalis (RZ18), Nodaria verticalis (RZ180), Mecodina praecipua (RZ268), Idia aemula (RZ271) and Asota heliconia (RZ44). Bracoviruses are a known genus of mutualistic viruses with a complex life cycle. Integrated in the genome of a braconid parasitic wasp, the bracovirus is transcribed during oviposition in lepidopteran larvae49. The presence of this viral genetic material in adult moths might suggest an unsuccessful infection by the parasitoid, and the survival of the larvae carrying the parasitic viral particles. Another potential explanation includes the possibility for the viral DNA to be integrated into the lepidopteran genome, as it is usually found in its common Hymenoptera host. Only studies simultaneously investigating parasitism success rate and tissue tropism of the bracoviruses in the Lepidoptera and Hymenoptera hosts, will be able to inform on the nature of these interactions.

From a methodological point of view, the present study shows the successful exploratory approach to mine for potentially hidden associated microbial diversity in genomic data. Our study was performed on shallow genome short reads obtained using Illumina platform. The original purpose of this sequencing effort was to study the phylogenomics of the hosts species50, but a similar approach to the one we have taken here can be implemented to any publicly available genomic datasets. The popularity of genomic scale sequence data methods, such as Illumina short read approach, created a wide publicly open genomic resource for the research community to study questions that are not directly into the focus of the studies generating them. It is however important to also consider the limitations of such approaches. First, the quality and completeness of the reference datasets needed for programs like Kraken2 are bound to significantly affect the results. Second, incomplete and shallow genomes tend to present false negatives when mined for many symbionts. In addition, the origin of the DNA used for the genome sequencing will affect any conclusion on presence/absence or abundance of the symbionts detected and those undetected. In our study, all the used genomes came from DNA extracted from legs, therefore there is a methodical hard bias against gut fauna for example, however as shown in other studies some symbionts as Wolbachia are also found in the haemolymph of arthropods51. Third, this kind of exploratory analyses of genomic material does not inform about the nature of the interaction between the organisms found in the genomic mix. Furthermore, in the majority of cases, this method also does not inform on the origin of the organisms. This is especially important as sample contamination is a known problem since the appearance of the molecular sequencing techniques. Finally, this method is not suitable for quantification of the present organisms. Altogether, these limitations exemplify the exploratory nature of the approach we used in this study, and that we at best provide suspicion for diverse symbiotic infections in different Erebidae moth species, which presence and importance will only be fully confirmed via direct screening, and ecological and evolutionary studies of natural populations.


As we expected, our method detects various symbiotic partners in several Erebidae moth species, including Wolbachia and the bacteriophage WO in four species, Burkholderia in one other species, and Sodalis and Arsenophonus simultaneously in two species. Although symbiotic associations of Lepidoptera with Wolbachia is likely, similar long-term associations between the three other symbionts and the Lepidoptera have yet to be described. Similarly, we detect DNA material from bracoviruses that are currently only described as mutualistic symbionts of Hymenoptera. The true nature of these associations requires further experimental investigation. The detection of bracovirus DNA could for example suggest ecological interactions between moths and parasitoids, and the ability of the formers to naturally resist parasitoid attack strategies. Altogether our study presents a method and produces material supporting testable hypotheses about the diversity and nature of symbiotic interactions in those particular Lepidoptera species. With the availability of open access metagenomics databases, this field promises extensive and exciting opportunities to explore potentially hidden symbiotic diversity.

Material and methods

Genome data

We used the data produced from the whole genome sequencing project of 47 Erebidae species (see50). The sampling information is shown in Table 1. This selection includes genomes representing the main described subfamilies and major lineages within the Erebidae family. The DNA was extracted from one or two legs of the selected samples. Extractions took place in 2000 s/over a decade ago, for the purpose of another study (see52). It is important to keep in mind that the genome sequencing approach generating this dataset is not optimized to recover the symbiont diversity of these organisms, therefore the diversity is likely to be systematically underestimated.

Metagenomic analysis

The raw reads were quality checked with FASTQC v0.11.853. Reads containing ambiguous bases were removed from the dataset using Prinseq 0.20.454. Reads were cleaned to remove low quality bases from the beginning (LEADING: 3) and end (TRAILING: 3) and reads less than 30 bp in length. The evaluation of read quality with a sliding window approach was done in Trimmomatic 0.3855. Quality was measured for sliding windows of 4 bp and had to be greater than PHRED 25 on average. Cleaned reads were assigned taxonomic labels with Kraken256 and MetaPhlAn 2.057. Kraken2 was run using a custom database, which contained the standard kraken database, the refseq viral, bacteria and plasmid databases and all available Lepidoptera genomes from genbank (Supplementary Table 1 contains a full list of taxa included), confidence threshold of 0.05, and a mpa style output. MetaPhIAn was run using the analysis type rel_ab_w_read_stats, which provides the relative abundance and an estimate of read numbers originating from each clade. We visually screened the result for each sample, focusing on seven genera of vertically transmitted bacterial symbionts (i.e., Arsenophonus sp., Cardinium sp., Hamiltonella sp., Rickettsia sp., Sodalis sp., Spiroplasma sp. and Wolbachia sp.), one group of fungal symbionts (Microsporidia), and three types of viral symbionts (i.e., Wolbachia-phage WO, ichnovirus and bracovirus). This represents a non-exhaustive list of the maternally inherited symbionts found in diverse insect hosts, but covers all of those that have already been characterized within Lepidoptera10. We also checked on the presence of the gut bacteria Burkholderia sp., which are known to confer pesticide resistance to their host in the pest bean bug Riportus pedestris (e.g., ‘can degrade an organophosphate pesticide, fenitrothion’)58.

To discriminate between true and false positives a mapping analysis was carried out. For Wolbachia positive samples (list), cleaned reads were mapped to both the wMel (GCF_000008025.1) and wPip (GCF_000073005.1) genomes uses bowtie2 v2.4.159 (sensitive local option). The resulting sam files were converted to sorted bam files with samtools v1.1060. Coverage information was obtained using samtools depth, and the resulting graphs plotted with ggplot package61 in R.