Introduction

Honey bees are central to pollination of most flowering plants. They contribute more than $15 billion to the value of agricultural crops each year in the USA alone1. Wild bees are mostly solitary and also provide pollination services to up to 80% of flowering plants. Few important crops such as tomatoes, eggplants, cranberries and blueberries can only be pollinated by buzz pollinators among these species. Extrapolated from the number of blueberries pollinated by each individual, the estimated value of each Habropoda laboriosa bee (south-eastern blueberry bee) is $202. With as much as 40% of the honey bees dying each year1, 3, these solitary bees are viewed as possible alternatives and must be targeted for our future research.

H. laboriosa, found in south-eastern region of USA, is one of the earliest branching species in the family Apidae4, 5. This soil-dwelling species is oligolectic (specialist pollinator) on blueberries (Genus Vaccinium)6. It is the evolutionarily closest solitary bee to the honey bees with a sequenced genome7. Similarly, D. novaeangliae is a soil-dwelling solitary species found in north-eastern USA, but it is an oligolege of pickerel weed (Pontederia cordata) 8, 9. It belongs to the family Halictidae and is the phylogenetically most distant bee species to the honey bees amongst all the sequenced species7. Due to their unique phylogenetic placement and their solitary behaviour, they can be used for comparative analysis of evolution of eusociality along with the honey bees.

The evolution of social complexity in insects is thought to be accompanied with changes in the olfactory machinery. Genes involved in perception of the chemical cues may also undergo selective evolution along with the increase in tasks that need communication among individuals of the same or another colony. The best gene family candidates to test this hypothesis are olfactory/odorant receptors (ORs). First, since they are known to undergo rapid birth and death evolution in response to the needs of each species10, 11 and second, they seem to have expanded in eusocial bees and ants compared to distant solitary insect orders12,13,14,15.

Such analysis of OR evolution across a pair of closely related solitary/eusocial species is mostly hindered by the non-availability of sequenced genomes and gene annotation pipelines that miss a fair amount of OR genes during gene prediction. Recently, the genomes of few solitary bees were sequenced7. We addressed the second challenge by building a semi-automated computational pipeline (as described in our previous study on A. florea ORs)15. In this analysis, we have introduced more distant queries and a target focussed approach for search into the solitary bee genome sequences. The identified ORs were validated through domain searches, transmembrane helix prediction (TMH) and synteny analysis. This was followed by phylogenetic reconstruction of ORs from the two solitary bees, two honey bees, an ant and a wasp. Important subfamilies/clades of ORs identified from previous literature were inspected for possible unusual trends shown by solitary bees. Finally, we also investigated presence of upstream cis-regulatory elements across lineages and OR subfamilies. Altogether, our analysis sheds light on the evolution of ORs and their putative regulatory elements from solitary and eusocial honey bees.

Results

Genome-wide survey (GWS) of ORs from two solitary bees

Our computational genome-wide survey for OR genes in the solitary bees D. novaeangliae and H. laboriosa genome resulted in the identification of total 112 putative DnOrs and 151 putative HlOrs respectively (Table 1, Supplementary Tables S1 and S2 and Supplementary Data S1 and S2).

Table 1 ORs identified through genome-wide survey.

More than 30 genes are annotated as ‘partial’, due to absence of either termini or missing internal exons in both the species. This is either due to partial genomic scaffolds or their engagement in formation of alternative isoform with a neighbouring gene model. As the presence of alternate gene models is difficult to conclude without transcriptome data, these gene models were retained as partial and are not considered as pseudogenes, unless they possess pseudogenizing elements such as frame-shifts or in-frame STOP codons. Despite presence of such a large number of partial sequences, 80 to 90% of the total proteins from both the genomes passed 7tm_6 (characteristic of Drosophila-like odorant receptors) validation and majority of them show presence of six or more TMHs (Supplementary Fig. S1). More than 100 ORs were discovered at entirely new genic regions, where no gene was annotated before by NCBI annotation pipeline. Majority of the remaining gene models differ from their overlapping NCBI gene counterparts and were found to be better in terms of the presence of the 7tm_6 domain and transmembrane helices and hence were retained. 11 and 19 of the DnOrs and HlOrs were pseudogenes, respectively, and were almost equally distributed in both complete as well as partial gene models. Upon annotation of DnOrs and HlOrs, we observed that the bidirectional 1:1 orthologous relationships (as were observed between AfOrs and AmOrs15) were rare between these two distantly related solitary bees but 1:many or many:many relationships were more.

Comparison of number of ORs across various insect orders

We compared total number of OR genes with genome sizes across multiple insect orders (Fig. 1). The total number of ORs was correlated with the genome size with Pearson’s correlation coefficient of 0.706 for insects from orders Diptera, Lepidoptera, Hemiptera, Pthiraptera and Blattodea. Interestingly, most of the Hymenoptera and Coleoptera species possess higher number of ORs than the species in the other orders.

Figure 1
figure 1

Comparison of total number of ORs and genome size for insects from various orders. Number of ORs and genome size in Mb is plotted for insects from order Diptera (circles), Lepidoptera (diamonds), Hemiptera (triangles), Pthiraptera (plus sign), Blattodea (cross sign), coleoptera (star) and Hymenoptera (filled squares). Line showing correlation between the two quantities for the first five orders is plotted (Pearson’s correlation coefficient = 0.706). Hymenopteran species are further divided into solitary (yellow), primitively eusocial (red) and advanced eusocial (brown) species. Note that most hymenopteran species lie above the line and do not follow any trend across degrees of eusociality.

Phylogenetic reconstruction of ORs

The phylogenetic tree of ORs from six hymenopteran species was divided into total 34 clades, including Orcos (Fig. 2a and b and Supplementary Fig. S3). We observed bootstrap support of more than 95 for most of the clades. First 30 clades follow the subfamilies defined before13,14,15. Remaining clades contain NvOrs that were previously not considered in the phylogenetic analysis but form their own clades (though less populated) and hence are called as separate clades.

Figure 2
figure 2

Phylogenetic reconstruction of ORs from solitary bees with other hymenopteran ORs. Phylogenetic reconstruction of ORs from D. novaeangliae, H. laboriosa, A. florea (dwarf Asian honey bee), A. mellifera (European honey bee), H. saltator (Indian jumping ant or Jerdon’s jumping ant) and N. vitripennis (parasitoid jewel wasp). (a) Phylogenetic tree of ORs - Branches are coloured according to the species. The clades are specified by surrounding colour strips around the phylogenetic tree. Description of these OR clades/subfamilies is given in (b). Clade X is further subdivided into three groups- Xa, Xb and Xrest and respective OR distribution is given at the bottom. Detailed phylogeny with OR names and bootstrap values can be found at Supplementary Fig. S3.

First 21 OR clades were identified in both the honey bees15. ORs from both the solitary bees also clustered with these clades. DnOrs were missing from the subgroup Xa (part of clade X or subfamily L), XVI (subfamily Z) and XVII (subfamily G). HlOrs corresponding to clades VIII (subfamily P) and XII (subfamily F) could not be identified. Additional two sequences each from the two species clustered with clade XXV (subfamily R) and XXVII (subfamily X).

In most of the clades, the number of ORs from the solitary bees is less as compared to the honey bees. This could be expected from the comparison of the total number of ORs. However, there were notable exceptions such as clade VI (subfamily T) and clade XV (subfamily E) members. The number of HsOrs and NvOrs in the same clade XV is even greater. Clade X (subfamily L and all its subgroups) and clade XXI (subfamily J) show gradual increase in the number of ORs from D. novaeangliae to A. mellifera, with almost no OR from solitary bees clustering with the other Xa group members. Overall numbers of ORs from solitary bees were also smaller in clade XVIII (subfamily H), as compared to the honey bees and the ant species under study. The possible implications of these observations are discussed later.

Syntenic regions

Many DnOrs and HlOrs displayed similar syntenic order, as observed for AfOrs. We focussed on a particular stretch of 57 AfOrs present on one scaffold from A. florea and its homologous regions in the two solitary bees under study (Supplementary Fig. S2). We confirmed retention of the similar syntenic order as it was observed before in AmOrs and AfOrs, but with fewer 1:1 reciprocal best hits (orthologous ORs) and with more 1:many or many:many homologous hits in similar order on the three genomes. A recent analysis on ORs shows similar trends in corbiculate bees separated over broad divergence times16. A close comparison of this D. novaeangliae scaffold, H. laboriosa scaffold and A. florea scaffold revealed an increase in the number of putative tandemly duplicated ORs in A. florea. Especially two regions on the scaffold consisting of AfOr4-15 and AfOr36-50 presented extensive tandem duplications. Short intergenic regions between pairs a) DnOr3 and DnOr24like_1, b) DnOr35 and DnOr53/54, c) HlOr1 and HlOr16_2PF and d) HlOr35 to HlOr51 also serve as validations that the orthologous OR genes to AmOr4-15 and AfOr36-50 are likely absent in the two solitary bees.

Putative cis-regulatory regions of hymenopteran ORs

We identified 10 conserved motifs from unaligned three hundred bp upstream regions of OR genes from six hymenopteran species using expectation maximization algorithm implemented in MEME v4.11.217. We provided E-value cut-off of better than 10−10 and the total number of occurrence across all provided sequences to be more than 40 (Fig. 3, Supplementary Table S3 and Supplementary Fig. S4). Comparison of these possible OR cis-regulatory motifs across species unveiled differential distribution of motifs between the bee lineage and the ant H. saltator (Table 2).

Figure 3
figure 3

Upstream conserved DNA elements of hymenopteran ORs. Ten upstream conserved motifs modelled using MEME for ORs from six hymenopteran species are shown here. Their E-values and number of occurrences are mentioned below each motif. More information can be found at Supplementary Table S3 and Supplementary Figs S4 and S5.

Table 2 Distribution of putative upstream regulatory elements of ORs across bee, ant and wasp lineages.

Motifs 1, 2 and 4 were more prevalent in bee species, whereas Motif 3 is more prevalent upstream of ORs from H. saltator. Motifs 6, 7, 8, 9 and 10 were also found to be more prevalent in the ant as compared to the other species. Motif 5 was more prevalent in the two solitary bees and the ant, but it was less abundant in honey bees. None of the motifs were highly prevalent in the wasp N. vitripennis. This could be attributed to our motif detection method, as it relies on the abundance of the motifs in all the given input sequences (in this case highly dominated by the bee ORs).

We further divided our dataset into the bee lineage and the ant lineage and calculated the distribution of motifs per clade (Supplementary Tables S4 and S5 and Supplementary Fig. S5). In bees, high percentage of ORs from clade IX (subfamily K) and clade X (subfamily L) contain Motifs 1, 2 and 4 upstream. Motif 2 was highly abundant and considerably abundant upstream of bee ORs from clade VIII (subfamily P) and clade XI (subfamily 9-exon) respectively. A large percentage of ORs from clade XVII (subfamily G) also show presence of Motif 10 upstream to them.

Motif 3 was abundant upstream of ant ORs from clade III (subfamily V), clade IV (subfamily U), clade VI (subfamily T) and clade X (subfamily L). Other than Motif 3, clade III ant ORs genes also had a high percentage of Motif 7 and 8 upstream of them. In contrast, clade IV ant ORs had additional Motif 6, 8, 9 and 10. Motif 5 and 9 were also abundant upstream to ant ORs from clade VI. Motif 5 was abundant in few nearby clades as well. Other than these, few clades contain very few OR genes and hence high percentage of motif occurrence upstream of them may not be a biologically significant phenomenon.

We examined whether any of these motifs are already known to be transcription factor binding sites (TFBS). Only Motif 1 had good similarity to TFBS of a known vertebrate transcription factor called as NRF1 or ‘Nuclear Respiratory Factor 1’ with the E-value of 10−4.

Discussion

All Hymenoptera possess high number of ORs irrespective of their degree of social complexity

The number of ORs in D. novaeangliae is the least among all the bees studied for presence of ORs from fully sequenced genomes. In spite of that, the number of DnOrs (solitary halictid bee) is marginally larger (total 112) than the number of ORs found in most other insect genomes. The number of HlOrs (solitary Apidae) is considerably higher (total 151). Comparison of genome assembly quality across multiple well studied bee genomes, an ant genome and a wasp genome shows that the overall assembly quality (N50) of D. novaeangliae is very good only second best to A. mellifera (Supplementary Table 6). The assembly quality of H. laboriosa is also good in comparison with other genomes (Supplementary Table 6). Hence there is a very low probability of ORs being completely missed due to the quality of the assembly.

The solitary megachilid bee M. rotundata is reported to have a similar number of about 140 ORs (NCBI Gene database)18. Facultative primitive eusocial bee E. mexicana possesses 142 ORs16. Another primitively eusocial halictid bee L. albipes, with a colony size of only 10 bees, has around 180 ORs, and the obligate primitive eusocial Apidae B. terrestris with a colony size of about 100 workers has 164 ORs in its genome19. Number of ORs in the last two primitively eusocial bees are similar to that found in the advanced eusocial honey bees (around 180)12, 14, 20, 21. Among bees, obligate advanced eusocial stingless bee M. quadrifasciata possess the highest number of ORs (196 ORs)16. The advanced eusocial ants including the most basal species H. saltator have more than 300 ORs13, 14, 22. Both solitary endoparasitoid wasps N. vitripennis and M. mediator possess more than 200 ORs. To the best of our knowledge, among all hymenopteran species with sequenced genomes, only the highly specialised fig wasp C. solmsi possesses less than 100 ORs14. All antennal transcriptome based OR studies fail to capture the entire OR repertoire as shown in our previous paper15. To summarize, solitary, primitively eusocial and advanced eusocial hymenopteran insect species all have genomes of 200–300 Mb length, but their OR repertoires vary a lot, and our analyses show that there is no correlation between OR numbers and social life style (Fig. 1). However, there could be a correlation between social organization and number of ORs in specific subfamilies involved in intra-specific communication.

9-exon subfamily/clade XI is equally large in solitary bee H. laboriosa and eusocial honey bees, whereas in solitary bee D. novaeangliae the repertoire size is only half

A subfamily of ORs called as 9-exon (clade XI) has been hypothesised to be enlarged in eusocial species and suggested to be involved in nest-mate recognition via cuticular hydrocarbons23, 24. One argument was that these ORs are higher expressed in the workers of eusocial species13,14,15, 25. For example, OR transcripts from 9exon-alpha group were shown to be enriched on the ventral surface of the antennae of workers in the clonal raider ant Ooceraea biroi 25. The workers touch (or antennate) nest mates with this region of their antennae. In honey bees there is a correlation between worker-specific olfactory sensilla26 and a worker-enriched expression of 9-exon ORs (Supplementary Fig. S6)15. Indeed, HsOr271 strongly responds to 13,23-dimethyl-C37 (probably a fertility signal), whereas HsOr259-L2 responds to C3727. Thus, honey bee ORs from 9-exon-alpha group i.e. AmOr 122–139, 159, 172–177 and their homologs in other bee species, might be involved in contact based nest-mate recognition. Furthermore, many eusocial insects also use saturated and unsaturated hydrocarbons as sex-pheromones, which might have resulted in a higher variety of ORs recognizing such hydrocarbons in eusocial species as compared to their solitary relatives within a lineage28.

Opposing the above theory, with more solitary species under study, we do not see increased 9-exon ORs in eusocial insects alone. The number of ORs in D. novaeangliae belonging to 9-exon clade is almost half of the number of other bees, but H. laboriosa does have more ORs than any other well studied bee in this clade. Both solitary species form aggregations (H. laboriosa being more gregarious), but there is no record of active aggregation recognition behaviour by either of the two6, 8. Similarly the obligate primitive eusocial bumble bee possesses almost similar number of 9-exon ORs as that of the honey bees, but the obligate advanced eusocial stingless bee possesses only 26 ORs (Supplementary Table 6)16. Two facultative primitive eusocial orchid bees possess similar number of ORs as that of DnOrs in this subfamily (Supplementary Table 6)16. None of the ORs of the solitary wasp-N. vitripennis clusters with the 9-exon-alpha group but they do possess 90 ORs that group with other 9-exon ORs. Reanalysing the clustering of ORs of another solitary parasitoid wasp Microplitis mediator with AmOrs, we found only 13 9-exon ORs. Seven of these are male enriched (MmedOR3,4,5,7,9,19,26)29, 30. The question arises, why do these solitary species have lots of putative CHC sensing ORs?

There are two possibilities. Firstly, all 9-exon-alpha ORs may not be CHC responders, as the evidence for the same is mostly indirect25. According to this scenario, the ORs belonging to this group may respond to yet unidentified group of odorants. These odorants must not be linked to eusociality, but to other factors controlling the communication system that are different among these species. Second possibility suggesting that they are indeed CHC responders needs further analysis on the lines as discussed next.

First, the CHC receptor repertoire of any species should be dependent on the complexity of their communication system but this might not necessarily correlate with the degree of sociality. CHCs, probably first evolved as a desiccation and parasite barrier, and later acquired a function as a chemical signal for various communication purposes. CHC profiles vary a lot between species as well as within species with respect to food, age, mating status, etc. Some parasitoid wasps use CHC profiles to identify hosts or preys or use them for mimicry31, 32. Solitary insects can use CHCs as male attractants, probably reflected in enriched expression of male CHC ORs of Microplitis mediator. Overall the complex chemical ecology of solitary and social insects seems to drive the putative CHC receptor evolution than their degree of eusociality in this scenario as well. It needs further direct experimental probing for cognate ligands of these receptors to know the function of these varying repertoire sizes across species.

Putative honey bee queen mandibular gland pheromone receptor OR group is not expanded in solitary bees

Insect lineages that have evolved unique chemical signals for specific behaviours, may harbour lineage specific OR clusters. Honey bee queen mandibular gland pheromones have been studied extensively33. Unlike many other insect species, the major components of this mixture are keto-acids, alcohols and esters34. In all honey bee species studied so far the mandibular gland pheromones are composed of the same components with different relative concentrations35, 36. AmOr11 was identified to bind 9-ODA the major component of the queen mandibular gland pheromone37. AmOr11 belongs to a subgroup of subfamily L/clade X-subgroup a (Xa) which contains A. mellifera and A. florea OR4 to 1715. In addition, several ORs from the subgroup Xa from subfamily L show higher RNA expression levels in drones compared to workers in A. mellifera and A. florea 15, 37.

Closer inspection of phylogenies published for ORs from other corbiculate bee species and ant species shows an interesting trend (Supplementary Table S6)14, 16. The total number of ORs in obligate advanced eusocial honey bees, a stingless bee and ant species is higher than that of ORs from a bumble bee (obligate primitive eusocial) and orchid bees (solitary to primitive/weakly eusocial) in the subfamily L. For the two non-corbiculate ancestrally solitary bees studied here, the numbers of ORs in the subfamily L are almost one third of those in honey bees (Figs 2 and 4). The absence of tandem duplication of few of these ORs in solitary bees is supported by the synteny analysis. It also shows an increase in tandem duplication events of ORs in A. florea as compared to H. laboriosa and in H. laboriosa as compared to D. novaeangliae(Supplementary Fig. S2). Solitary wasp N. vitripennis has the least number of ORs in this subfamily, all of which do not belong to either the Xa or Xb subgroup.

Figure 4
figure 4

OR subfamily L with distribution of conserved upstream motifs 1 to 4. Phylogenetic tree of hymenopteran ORs from only subfamily L/clade X. Group Xa - putative pheromone receptor clade - is shown in green branches. Group Xb is shown in blue branches. A. mellifera pheromone receptor, AmOr11, for major component of queen mandibular gland pheromone (9-ODA) is highlighted in magenta colour. 4-methoxyphenylacetone receptor, HsOr55 is also shown in magenta colour. Motif 1 to 4 are shown in concentric circles from centre to periphery with colours ranging from red, orange, cyan and purple. Note that putative pheromone receptors of bee lineage possess only motif 1 upstream to them. Upstream DNA regions of bee ORs from group Xb possess motif 1, motif 2 and motif 4. Motif 1 is completely absent in a set of ORs (AmOr36-45,47,48 and corresponding homologs). On the other hand motif 3 is exclusively present upstream to H. saltator (bee) ORs. More information can be found at Supplementary Figs S4 and S5.

The increase in number of ORs in subgroup Xa is even sharper with increasing degree of eusociality. Moreover, of all solitary bee ORs analyzed, only one HlOr belongs to the subgroup Xa of putative honey bee queen mandibular gland pheromone receptors (Figs 2 and 4). These findings nicely correlate with the theory of selective expansion of clades responsible for evolution of eusociality in Hymenoptera. Interestingly, there is a considerable number of ORs in Harpegnathos saltator that also cluster with this clade X and few of them also show male enriched expression, but they form their own group away from the ORs of the bees. These ORs could be involved in recognition of ligands that are similar to the honey bee queen mandibular gland pheromones or other male attracting sex pheromones. Interestingly HsOr36 is one of the male-enriched ant ORs from the same subgroup. It has been shown to bind to octacosane, a longer chain hydrocarbon, but currently there is no evidence for such CHCs as sex pheromones in H. saltator 27. Five other HsOrs from the same subgroup also displayed male-enriched expression with subthreshold (<30 spikes) response to many CHCs27. Hence the cognate odors for these HsOrs are possibly yet to be unearthed.

The other subgroup Xb also shows a big difference in the number of ORs between honey bees and solitary bees (Figs 2 and 4). This group contains 4-methoxyphenylacetone receptor from H. saltator, HsOr5513. This odorant is a component of anise essential oil which has been shown to have repellent effect on mosquitoes38, 39 and lethal effect on a few insect pests40. In contrast, anise is attractant for bees and beetles, and is often used in honey bee behavioural experiments41, 42. HsOr59 has been shown to be stimulated by formic acid (alarm pheromone for formicine ant), citronellol, geraniol and 2-3-butanedione27.

Subfamily H, a subfamily with putative floral scent receptors, is enlarged in generalist flower visitors

AmOr151 and AmOr152 from clade XVIII (subfamily H) respond to linalool and other floral scents43 and most ORs from this clade are higher expressed in workers than in drones15. Thus this clade has been recognized as putative floral scent receptor clade probably specialised on terpenoids. Interestingly, the number of DnOrs and HlOrs belonging to this clade is very small as compared to both the honey bees (Fig. 2). Both D. novaeangliae and H. laboriosa are specialist pollinators. H. laboriosa is oligolectic on blueberry (Genus Vaccinium) in some states of USA44 and D. novaeangliae is oligolege of pickerel weed (Pontederia cordata)8, 9. It is possible that these specialist species do not need a variety of floral scent receptors and hence did not expand as compared to the honey bees (generalists). The ORs in this clade could be important for pollen and/or nectar scent detection.

Interestingly, AfOr155 was found to be highly abundant in males than in females (in contrast to the expectation of typical floral scent receptors to be enriched in workers), whereas two other AfOrs do show significant female enriched expression15. HsOr210 is a distantly related worker-enriched ant OR from the same subfamily but it gave suprathreshold (>30 spikes) response to a C32 CHC. HsOr209 responds strongly to 2,3-butanedione27. In the light of these contrasting discoveries, it is imperative to deorphanize the other ORs from the clade through experimental procedures.

Previously identified bee specific clade is expanded in solitary bees as well

Clade XXI (subfamily J) was previously identified to be expanded in honey bees and orchid bees as compared to ant or wasp species13, 15, 16, 45. The same was observed for a bumble bee, a stingless bee (both corbiculate bees) and a halictid bee19, 46. This study establishes that the clade is expanded in obligate solitary bees as well (Fig. 2) and points out to their involvement in a mechanism shared by all the bees irrespective of their degree or plasticity of eusociality. Could this be an OR-subfamily for non-terpenoid floral scents? Cognate ORs for aromatic and aliphatic odours indeed tend to cluster separately from terpenoid ORs in a phylogeny of moth ORs47. More experimental analysis in bees will be needed to discover function of these ORs.

Other important phylogenetic clades

Clades VI (subfamily T) and XV (subfamily E) are expanded in solitary bees, but we do not know about any of their cognate odorants. Other than Orco, the number of ORs from each bee in Clade II (subfamily I - AmOr161 and its orthologs), V (subfamily Q - AmOr160 and its orthologs), VII (subfamily M - AmOr62 and its orthologs), XIII (subfamily B - AmOr119 and its orthologs), XIV (subfamily C - AmOr116 and its orthologs), XIX (subfamily W - AmOr120 and its orthologs) is preserved. These are expressed at similar levels in both worker and drone antennae of A. florea (except AfOr120)15 and are possibly more ancestral and important clades for bees; again possible functions of most of these are unknown.

Recently many ORs were tested for their responsiveness to an array of CHCs27. HsOr188 from subfamily B was found to respond to C20 alkane which is a less abundant shorter hydrocarbon for a typical insect cuticle27. Homologous bee ORs from bees, AmOr119, AfOr119, DnOr119 and HlOr119 are highly likely to show affinity to the same ligand across both the sexes based on their high sequence identities, conservation of the number of clade members across bees as well as other Hymenoptera and similar levels of expression across males and females of A. florea.

In addition to above, HsOr170 (subfamily V) responded to longer chain CHCs and HsOr236 (subfamily E) was unique to respond to two even-numbered higher length hydrocarbons found rarely in insect cuticles27. HsOr161 (subfamily V) displayed excitatory response to ethyl acetate and inhibitory to pheromone 3-methyl-1-butanol, citronellol, citral and geraniol.

Hymenopteran OR genes possess conserved upstream DNA elements that are species-lineage-specific and OR-subfamily specific

Analysis of cis-regulatory elements of insect ORs has been previously performed in only Drosophila to the best of our knowledge48,49,50. Since we are interested in finding conserved elements that are universal across Hymenoptera, we performed a search for possible cis regulatory elements across six hymenopteran species (Fig. 3).

We found that the distribution of motifs was highly dependent on the lineage of the species (bee or ant) (Table 2), as well as the subfamily/clade-identity of the downstream ORs (Supplementary Tables S4 and S5), but they do not show exactly same evolutionary pattern as that of the downstream ORs. A motif was found to be conserved at −50 to −150 upstream of translation start site in almost all bee ORs from subfamily L. This motif is called as Motif 1. It was the only motif found upstream to ORs from putative honey bee queen mandibular gland pheromone receptor group (Xa) of subfamily L. Detailed analysis of subfamily shows a gradual decrease in the abundance of Motif 1 upstream to bee ORs from group Xa to Xrest to Xb (Fig. 4). At the same time, the abundance of Motifs 2 and 4 has increased. A subset of ORs from Xb, AmOr36-48 (except 46) and their homologues in other bees seem to have replaced Motif 1 with motif 2 in almost the exact same upstream position. Motif 1 PSSM allows for many substitutions and hence it was found upstream to as many as 4000 genes out of total genes (including OR genes) from four bees. This may seem like a ubiquitous DNA element that is probably found upstream to genes due to their high GC content, but closer inspection showed that the exact 5′-ACGCAAGCGC-3′ sequence was found upstream of total 37 ORs and only around equal number of other genes from the four bees. This is substantial enrichment upstream to only OR gene family as compared to any other. Similarly 5′-GCGCAAGCGC-3′, 5′-GCGCAAGCGT-3′ and 5′-GCGCAAGCTC-3′ are enriched upstream to ORs as compared to other genes. Overall, specifically 5′-[A/G]CGCAAGCG[C/T]-3′ sequence seems to be more enriched upstream to ORs than other genes.

We compared our 10 upstream DNA motifs with the known TFBS to find any known transcription factors that might regulate these ORs. Motif 1 bears substantial similarity with the TFBS of NRF1 - Nuclear Respiratory Factor 1 (central palindromic region - 5′-CGCATGCG-3′) from vertebrate transcription factors. Known ortholog of NRF1 in Drosophila is Ewg or ‘Erect Wing’ and is responsible for muscle as well as neural development, but there is no direct annotation for regulation of olfaction51. However, it is known to regulate specification and maintenance of photoreceptor subtype R8 in Drosophila 52. We propose that similar monoallelic robust expression of one or few ORs per olfactory sensory neuron might be regulated through Ewg or another similar transcription factor that recognises Motif 1 (at least for the subfamily L of ORs). Interestingly, Motif 1 was also found upstream to genes coding for transcription regulators involved in neuronal development and differentiation including ‘acj6 - abnormal chemosensory jump 6’, a POU-domain transcription factor known to regulate odour specificities in a set of neurons49, 53, 54. Is Motif 1 a TFBS for a yet unknown master regulator for olfactory sensory neuron type determination? More experimental analysis is needed to support this theory.

A detailed analysis of the evolution of ORs has been performed across species that are at the two extremes of social complexity scale. The most recent common ancestor of honey bees (Apidae) and H. laboriosa (Apidae) lineage diverged from honey bees more than 80 million years ago and D. novaeangliae (Halictidae) diverged from honey bees around 120 million years ago7. As these bees are more closely related to each other than the wasps (which were the only obligate solitary species available for comparison before) the comparison of number of ORs across subfamilies is more meaningful and certain patterns can be derived. We identified the OR gene set from the solitary bees ancestral to two independent events of eusocial development, D. novaeangliae (112 DnOrs) and H. laboriosa (151 HsOrs). The entire OR repertoire does not show considerable expansion in eusocial insects. Instead, insects from the order Hymenoptera have a tendency for incorporating larger OR repertoires that cannot be entirely explained by their genome sizes alone. However, a subset of OR subfamilies that respond to queen/female sex pheromones may show a trend that correlates to the sociality status of the species. Examples of such clades are 9-exon (putative CHC receptors) and L (contains putative honey bee queen mandibular gland pheromone receptors) and such trend was indeed followed in the later case. Additionally, subfamily H of putative floral scent receptors is not seen to be expanded in both solitary bees, possibly due to their specialist nature. On the contrary, subfamily J, which was previously found to be expanded in the primitively to advanced eusocial bees, is also seen to be expanded in both the solitary bees, indicating their contribution to bee-specific olfactory requirements that are yet to be unearthed. We also found an array of upstream conserved elements for OR genes, which show species-lineage and OR-subfamily specific distribution, which is not exactly similar to the evolution of OR proteins themselves. These likely cis-regulatory elements and their combinations may control expression of hymenopteran ORs e.g. Motif 1 is likely to govern expression of multiple bee ORs from subfamily L and possibly other olfactory genes.

Methods

Genome-wide survey (GWS) for OR genes from D. novaeangliae and H. laboriosa

We identified OR genes from the genomes of the two solitary bees using semi-automated manual curation of sequence homology based searches. Query dataset for the search was built using previously curated OR protein sequences from closely as well as distantly related species from the insect order Hymenoptera. This includes ORs from A. mellifera 12, 20, 21, A. florea 15, B. impatiens 19, M. rotundata (from NCBI Gene database), L. albipes 14, C. biroi 55, N. vitripennis 56, M. mediator (from NCBI Gene database) and C. cinctus 57. Fragmented proteins (with lengths smaller than 100 amino acids) and extended erroneous proteins (with lengths longer than 600 amino acids) were removed. ORs with 7tm_6 domain (characteristic of Drosophila-like odorant receptors) were retained using E-value cut-off of 0.01 using batch CD-search58, 59. These 1249 curated OR protein queries were used to search against genomes of D. novaeangliae Version 1.17 and H. laboriosa version 1.27 (both downloaded from NCBI) using Exonerate Version 2.2.0 with BLOSUM62 matrix and maximum intron length of 200060.

Best scoring Exonerate alignment for every unique location on the genomic scaffolds were selected, compared with annotations by NCBI and putative protein sequences for the same were extracted using in-house Perl scripts. In rare cases, our gene models were modified with the help of NCBI gene annotations to exclude pseudogenizing elements or to get better START and STOP positions. Additional Exonerate searches were performed in order to complete partial gene predictions with the help of parameters like maximum intron size of 10000 and PAM250 matrix. TBLASTN61 was also implemented in some cases with PAM250 and BLOSUM45 matrices, unmasking of repeat-rich regions and with relaxed gap-introduction and gap-extension penalties to ensure completeness of the gene model. Every gene model was manually checked for presence of START and STOP codons, correct intron-exon boundaries and similarity with the existing gene models from NCBI annotation release 10062, 63. They were further stitched/modified wherever needed and OR protein sequences were corrected. If any gene models possessed frame-shifts with respect to the most identical sequence from the queries or intermittent STOP codons even after manual curation, they were declared as pseudogenes. ORs obtained through this genome-wide survey were annotated according to their orthology with AmOrs. Perfect best bidirectional BLASTP64 hits were named as ‘DnOr/HlOr’ followed by respective ‘OR type/number’ from A. mellifera. If the hits were not bidirectional, the respective OR type/number was suffixed with ‘like’. If multiple sequences possessed highest identity with a single AmOr sequence, they were suffixed with ‘_’ and an incremental number. Hypothetical proteins from pseudogenes were created by introducing ‘X’ in place of STOP codon or frame-shift mutation and their names were suffixed with ‘P’. Sequences with only N or C-terminus were suffixed with ‘N’ or ‘C’ respectively. If both termini were missing or the protein was present in multiple fragments stitched together, it was suffixed ‘F’, ‘N_C’, ‘N_F’ or ‘F_C’. In case of gene models with intermittent missing amino acids, ‘Z’s were introduced in the sequence. Very distantly related sequences to AmOrs as well as AfOrs were assigned new numbers 180 and 181. The numbers of DnOrs and HlOrs were compared with ORs across multiple insect orders collected from literature sources (or rarely from NCBI Gene database)- Diptera65,66,67,68,69,70, Lepidoptera71,72,73, Hemiptera74, Pthiraptera75, Blattodea76, Coleoptera77 and Hymenoptera12,13,14,15, 19, 56, 78.

Validation of OR gene models

Final OR protein sets from both the solitary bees as well as from A. mellifera - AmOrs, A. florea - AfOrs, H. saltator - HsOrs and N. vitripennis - NvOrs were subjected to transmembrane helix (TMH) prediction using HMMTOP79, 80, TMHMM81, 82 and PolyPhobius83, 84. Consensus TMH prediction was derived for each amino acid of all sequences based on support of at least two out of three methods85 and was compared across these datasets. Similarly sequence domain search against Pfam86 and CDD59 was performed for the same datasets and compared.

Analysis of syntenic regions in the three bee species

Synteny of OR genes was explored for D. novaeangliae, H. laboriosa and A. florea. As most of the sequences show perfect orthology between A. mellifera and A. florea 15, only A. florea was chosen among the two honeybees. FASTA sequences of scaffolds containing OR genes were extracted along with their corresponding annotation files in GFF format. These were analysed using SyMAP v4.2 (Synteny Mapping and Analysis Program)87, 88 for syntenic blocks across genomes using default parameters. Here BLAT was performed internally with default parameters (minScore = 30, -minIdentity = 70). The default parameters for defining syntenic regions include Top N = 2 (Retain the top ‘2’ hits for every sequence region as well as all hits with score at least 80% of the second hit) and Min Dots = 7 (Minimum number of anchors required to define a syntenic clock = 7). MCScanX (adjusted MCScan algorithm for detection of synteny and collinearity)89 was also used for inspection of syntenic OR gene containing regions and tandem duplications within the genomes with default parameters as follows: −b = 0 (calculate both intra and inter-species collinear blocks), −k = 50 (final score = MATCH_SCORE + NUM_GAPS*GAP_PENALTY), −s = 5(MATCH_SIZE = number of genes required to call a collinear block) and −e = 1e-05 (E-value cutoff). BLASTP of ORs from the three species was performed against themselves (E-value of 10−5 or less) and only the best non-identical hits were provided as input to MCScanX along with the combined GFF file derived from all the three species. Largest OR gene syntenic region consisting of A. florea scaffold NW_003789703.1 (~Chromosome 2), H. laboriosa scaffold NW_017100842.1 and D. novaeangliae scaffold NW_015373891.1 were critically examined for presence of previously undetected genes.

Phylogenetic reconstruction of ORs from six hymenopteran species

OR protein sequences from six hymenopteran species mentioned before were collected and all sequences with lengths smaller than 200 amino acids were removed. Remaining 1240 sequences (176 AmOrs, 171 AfOrs, 92 DnOrs, 123 HlOrs, 377 HsOrs, 301 NvOrs) were aligned using MAFFT v7.123b E-INS-i strategy with JTT200 matrix and 1000 iterations90. The resulting alignment was trimmed using trimAl91 ‘automated1’ option. Maximum likelihood based phylogenetic tree was reconstructed for the reduced alignment (192 alignment positions) using RAxML v7.4.292 with PROTCATJTTF matrix, 100 rapid bootstraps and six olfactory receptor-coreceptor sequences as outgroup. The output of this was provided as a guide tree for the second iteration of the alignment using MAFFT. This alignment was again trimmed using trimAl option ‘gappyout’ which retained considerable (401) alignment positions. Second round of phylogenetic reconstruction was performed on the refined alignment using RAxML with similar parameters. The tree was visualized using iTOL v393 and it was subdivided into 34 subfamilies/clades with the help of existing hymenopteran OR tree13,14,15.

Analysis of putative cis-regulatory regions of hymenopteran ORs

The information of gene loci of DnOrs (this study), HlOrs (this study), AfOrs15, AmOrs12, 20, 21, HsOrs13 and NvOrs56 were collected. The three hundred nucleotide upstream region of all these OR genes were extracted and only the ones with lengths greater than 100 were retained using a Perl script. These were subjected to motif identification using MEME v4.11.217, 94 for maximum 10 motifs of width 6 to 10 with zero or one occurrence per sequence and E-value cut-off of 10−5. The motifs were mapped onto the OR protein phylogenetic tree using iTOL and compared for their distribution across species and across phylogenetic superfamilies/clades. All the motifs were scanned against various existing TF databases using TOMTOM module95 of MEME suite. To check whether Motif1 is present upstream to only OR genes, 300 nucleotide upstream regions were collected for all the genes in the six genomes mentioned earlier and submitted to FIMO96, a module of the MEME suite. A separate phylogenetic tree of ORs from only clade X was built using non-guided manually trimmed alignment and distribution of Motif 1 to Motif 4 was mapped around this phylogenetic tree for better understanding of the evolution of these motifs.

All data generated or analysed during this study are included in this published article (and its Supplementary Information files).