Introduction

Moonlighting genes are genes that encode proteins having multiple distinct and often unrelated functions1. Paradoxically, these proteins often have a cytosolic location as well as being found on the exterior of the cell. To our knowledge, no secretion-system partners have been identified for these proteins. In the 30 years since glyceraldehyde 3-phosphate dehydrogenase (GAPDH) was found to have a secondary role on the cell surface of pathogenic streptococci2, many other examples of moonlighting have been uncovered. Such examples include gene products with well-known cytosolic roles that somehow end up on the surface of the cell, or are excreted into culture media. The manually curated MoonProt database now lists over 300 such genes, spanning host organisms that range from bacteria to yeast, protists, archeons, plants, and mammals. Many fundamental questions remain unanswered: How do these genes acquire multiple functions? How do they gain access to the exterior of the cell, in the absence of secretion-system partners? Why do some metabolic enzymes get secreted while many others do not? And how is it that the same proteins (e.g. GAPDH, enolase, DnaK, GroEL, Ef-Tu, superoxide dismutase) serve in moonlighting roles across diverse hosts? Because the same proteins are often found in moonlighting roles across phyla, it seems likely that the phenomenon is made possible by processes that are fundamental to all life. Of note is that many genes involved in moonlighting are ancient, highly conserved genes, again pointing to underlying processes that are fundamental—perhaps even primordial, in some sense. In the present study, we aim for a top-down bioinformatics investigation of moonlighting genes, in which we look for high-level clues and pan-genomic causes and effects.

High-level overview

A key characteristic of cellular life is encapsulation: cells have an inside, and an outside, with durable structures separating the two. One way of looking at it is that the cell embodies an entropy gradient, with a high-entropy aqueous environment in the center, and a low-entropy (which is to say, highly structured) envelope, encompassing a membrane and structural components, at the periphery. The membrane components of the cell are largely composed of proteins containing non-polar amino acids; whereas by contrast, water-soluble proteins (such as those present at the center of the cell) have mostly polar amino acids on their surface. The genetic code offers a convenient (and universal) mechanism for specifying polar versus non-polar amino acids: a purine at base two of a codon virtually guarantees the selection of a polar amino acid, while a pyrimidine in the second base tends to guarantee a non-polar amino acid. This suggests a primordial genetic code that may (arguably, at least) have been a binary code, allowing either for polar or non-polar amino acids, based on the use of purines or pyrimidines in codons. (For discussion of this possibility, see Trifonov3). Whether RNA or DNA, the primordial genetic material may have been single-stranded, in which case transcription to mRNA (if it occurred) could happen in one direction only, namely in a 3′-to-5′ manner. However, with the arrival of double-stranded nucleic acids, transcription could occur in either of two directions. Due to complementarity, a message that encodes polar amino acids in one direction would naturally tend to encode non-polar amino acids in the other direction. A scenario can be imagined in the early days of double-stranded genetic material, namely the days before promoters, repressors, Shine Dalgarno sequences or other specialized sequence organizations such as UTRs/non-coding regions: at that point in time, transcription may have occurred bidirectionally, with water-soluble proteins produced in one direction and proteins rich in hydrophobic amino acids produced in the other direction, a situation that leads quite naturally to the production of membrane proteins and encapsulation of hydrophilic proteins within membranes (i.e., cellular life). Moonlighting is largely an issue involving “inside versus outside.” Therefore, it is only natural to wonder if clues to the phenomenon might involve questions of hydrophobic amino acid usage, and/or cell wall and cell membrane construction. We consider this and other questions in formulating bioinformatic techniques designed to discover moonlighting genes.

Materials and methods

Our model organisms include Streptococcus pneumoniae NCTC11032 (G + C content 40.6%), Escherichia coli NCTC11775 (G + C 51.7%), and Mycobacterium tuberculosis H37Rv (G + C 65.9%). RefSeq genomes were downloaded from NCBI’s repository. A total of 25 moonlighting genes were curated from the MoonProt database. These are genes for which ample evidence exists of moonlighting activity (Table 1). Many of these genes exist in more than one isoform. For our enrichment experiments, we count each isoform separately. For example, in M. tuberculosis, cysteine desulfurase exists as genes csd and iscS; superoxide dismutase exists as SodA and sodC; and so on. Altogether, counting all isoforms of all genes, M. tuberculosis contains 35 moonlighting-protein genes; E. coli was found to have 31; S. pneumoniae has 20.

Table 1 Targeted moonlighting genes.

The genes selected for this study have in common the characteristic that all are known to produce proteins having a cytosolic location as well as an extra-cytosolic location (either on the surface of the cell, or excreted into culture medium). Thus, they qualify under the rubric that has been called “Excretion of Cytosolic Proteins,” or ECP4.

We designed an enrichment assay in which we first score every gene on the basis of a metric, then sort genes by their scores, then obtain the top-scoring 20% of all genes. Within that top cut, we look for moonlighting genes, and other functional categories of genes. We then calculate fold-enrichment numbers, and compute an expectation value for each one based on cumulative hypergeometric probability. (The code for the enrichment analysis as well as for the hypergeometric-probability analysis is freely available at https://github.com/kasmanethomas/moonlighting/tree/main). We tried various “cur sizes” from 5 to 30% and consistently found enrichments at all cutpoints. The fold enrichments tended to be higher at smaller sample sizes. We settled on 20% as a cut size that would be appropriately inclusive yet not overly broad. The somewhat lower fold enrichments seen at this cut size mean that the numbers are properly conservative.

The metrics we use involve tallying the number (as a percent) of codons meeting a certain description: for example, one metric tallies the percentage of codons that match the pattern RNY, where ‘R’ is any purine, ‘N’ is any base, and ‘Y’ is any pyrimidine. Another metric we use involves obtaining Shannon entropies for purines/pyrimidines in bases one and three of all of a gene’s codons; these two entropies are then used to construct a 2D vector. Likewise we obtain the G + C entropies of bases one and three for a gene’s codons; these two entropies form a 2D vector. A metric is derived from the dot product of the two 2D vectors. (The motivation behind this metric is discussed in “Results”).

The code for calculating metrics and doing the enrichment assays consists of native JavaScript code created by the authors (see the Github repository at https://github.com/kasmanethomas/moonlighting for code listings). Our code conforms to ECMAScript2015 and we tested it in Google Chrome Version 107.0.5304.121 (x86_64).

Results

Our investigation revealed that two common factors exist for all moonlighting genes: first, they tend to be physically located near genes for enzymes involved in cell wall, cell membrane, or cell envelope construction; and second, they tend to encode, in antisense, small proteins that contain a high percentage of non-polar amino acids. We began by looking at where moonlighting genes occur on the genome of each biological model and found that they tend to co-locate with genes involved in cell wall and cell membrane construction. We then characterized moonlighting genes with respect to codon purine bias—and found that moonlighting genes have higher-than-average purine bias in both forward and backward (reverse complement) directions. Next, we developed enrichment assays based on these codon characteristics. The results of those assays were consistent with the idea that antisense open reading frames might exist in moonlighting genes. Accordingly, we searched for antisense ORFs in moonlighting genes. We found that not only do such ORFs exist, they often contain predicted transmembrane domains.

Location of moonlighting genes

In order to get an idea of the local environment in which moonlighting genes “operate,” we looked at their proximity to other genes. We asked: what are their neighbors? We can make a simple experiment of sampling for all genes that lie within plus or minus a certain distance (say five genes) of moonlighting genes, being careful to remove duplicate hits. The set of all nearest-neighbors within five genes of a moonlighting gene was tested. After the removal of duplicates, in M. tuberculosis H37Rv we found 298 genes (N = 298) with enrichment characteristics as shown in Table 2. Note: Similar results were found with a proximity radius of three as well as with ten. A radius of five was chosen because at lower ranges, the result set was comparatively sparse, containing only 136 genes, whereas at higher ranges, fold-enrichments tended to be low, with higher E-values. The most informative result-set was obtained at a radius of five.

Table 2 Enrichment: neighbor genes (N = 304) within 5 genes of moonlighting genes in M. tuberculosis H37Rv.

Notice that moonlighting genes tend to co-locate with cell-wall biogenesis genes. However, the most important numerical result is the large number of “hypothetical protein” genes found (Table 2). Almost 30% of the search-result set is composed of hypothetical-protein genes. It turns out, a much more informative picture can be seen once the “hypothetical protein” genes are no longer diluting our results. If we filter out the hypothetical-protein genes on the basis that they may be obscuring hidden results, and count only genes for which a function has been assigned, the enrichments look as shown in Table 3.

Table 3 Enrichment: neighbor genes (N = 210) within 5 genes of moonlighting genes in M. tuberculosis H37Rv, with “hypothetical protein” genes filtered.

Notice that genes involved in cell wall biogenesis, secretion, or inner membrane function are at the top of the list. However, the list now also includes many other important categories, including outer membrane, plasma membrane, and genes specifically involved in peptidoglycan synthesis. Similar results are obtained in E. coli (Table 4) and Streptococcus pneumoniae (please see Table 5).

Table 4 Enrichment: neighbor genes (N = 294) within 5 genes of moonlighting genes in E. coli NCTC11775, with “hypothetical protein” genes filtered.
Table 5 Enrichment: neighbor genes (N = 176) within 5 genes of moonlighting genes in S. pneumoniae NCTC11032, with “hypothetical protein” genes filtered.

Interestingly, all three organisms show enrichments for tRNA ligases. The MoonProt database lists 15 tRNA ligases (mostly from eukaryotes) as having moonlighting functions. Note that the subject of the moonlighting-gene “local environment” continues in the “Discussion”, where it is suggested that the proximity of these genes to membrane and cell wall building genes is far from coincidental.

Codon characteristics in moonlighting genes

In attempting to understand moonlighting genes, we began with what is arguably the single most basic and meaningful bioinformatic metric for studying any gene(s), which is the purine bias of base one of codons (hereinafter called R1). The significance of R1 (purine content, base 1) is that it is the single most reliable statistical indicator of open-reading-frame status. According to Ponce de Leon et al.5: “It is the only sufficiently robust signal for assisting in gene searches and annotations within genome investigations.” The reason is simple: Most genes, in most organisms, across all domains of life, have an average R1 value of 0.6 or more. That is, base 1 of codons in protein-coding genes is either adenine or guanine 60% of the time. While various explanations have been offered for this so-called “purine bias,” the simplest hypothesis is that the DA-da-da-DA-da-da cadence of this signal provides an easy way for the ribosome to detect and maintain frame alignment during translation. Figure 1 shows purine percent for each of the three bases of codons, versus CDS-genome G + C content, for N = 159 bacterial genomes. It can readily be seen that, for all organisms, the purine content of base one is significantly higher than for the other two codon bases.

Figure 1
figure 1

Purine content (A + G) of bases one, two, and three of codons for N = 159 bacterial genomes. Each dot represents a complete genome. See Supplement A for details.

In M. tuberculosis, the CDS-genome-wide mean value of R1 for all codons was found to be 0.60515 ± 0.05462. The R1 values for all moonlighting genes are graphed (Fig. 2). One can notice that a majority (24/35) of moonlighting genes show above-average R1 values.

Figure 2
figure 2

Purine content (R1) of base one of codons for N = 35 Moonlighting Genes in M. tuberculosis H37Rv. The genome-wide average of 0.60515 ± 0.05462 (for all 3906 CDS genes) is depicted in black.

The unusually high R1 values for moonlighting genes caused us to wonder if R1 might also be high in the opposite direction, on the opposite strand of DNA. Figure 3 shows the base-one purine content for anti-codons of the same genes (that is, codons in the reverse-complement of the message strand). Somewhat surprisingly, 25/35 genes show above-average purine bias in reverse-complement codons.

Figure 3
figure 3

Reverse-complement purine content (R1) of base one of anti-codons for N = 35 Moonlighting Genes in M. tuberculosis H37Rv. The genome-wide average of 0.53089, 0.05213 ± 0.05213 (for all 3906 CDS genes) is depicted in black.

An enrichment assay based on codon metrics

The forward and reverse high R1 values of Figs. 2 and 3 suggest a strategy for obtaining enrichments of moonlighting genes: obtain R1FORWARD and R1REVERSE-COMPLEMENT for every gene in the CDS genome, add the two together, and sort all genes by that metric. Then take the top 20% of genes and see how many are moonlighting genes. When we did this, we found enrichments as shown in Table 6.

Table 6 Enrichment for moonlighting genes in M. tuberculosis H37Rv using R1-forward plus R1-rc (reverse complement) as a metric.

The top 20% of genes sorted by using the metric contain 14 of 35 moonlighting genes, for a 2.00-fold enrichment, at cumulative hypergeometric odds of 0.005. Next, we considered whether a metric summing R1-forward and R1-reverse would simply be equivalent to tallying the percent of codons that match the pattern RNY, where R is any purine, N is any base, and Y is any pyrimidine. Our analysis shows that the two metrics are not the same and they give slightly different results. The RNY metric is more effective (Table 7).

Table 7 Enrichment for Moonlighting Genes in M. tuberculosis H37Rv using RNY (purine-any base-pyrimidine) as a codon metric.

The enrichment for moonlighting genes in M. tuberculosis shows a value of 2.29-fold at an expectation of zero. This means moonlighting genes, more than other genes, contain codons matching the RNY pattern, a pattern that is inherently bidirectional (since the anticodon of RNY is also RNY). We naturally wondered if this result is limited to M. tuberculosis, or might apply generally, to other organisms. Thus, Tables 8 and 9 show the results for E. coli NCTC11775 and Streptococcus pneumoniae NCTC11032.

Table 8 Enrichment for moonlighting genes in E. coli NCTC11775 using RNY (purine-any base-pyrimidine) as a codon metric.
Table 9 Enrichment for moonlighting genes in S. pneumoniae NCTC11032 using RNY (purine-any base-pyrimidine) as a codon metric.

The RNY enrichment technique was effective in all three organisms. Also, notably, the gene functional categories were in good agreement across organisms; for example, ribosomal protein genes are generally enriched by this technique.

Enhancement of enrichment

Two main questions arise in regard to the above results: (1) Could our enrichment technique be refined or improved in some way? (2) Why does the technique work at all? The presence of high purine bias in forward and backward directions suggests the potential for reverse transcription, and translation of antisense RNA, in these genes. We decided to pursue, as an Ansatz, the hypothesis that antisense open reading frames (asORFs) might exist in moonlighting genes. This led us to consider ways in which information running in two directions in the same gene could feasibly coexist. Our consideration was that the RY (purine/pyrimidine) axis might encode information differently than the SW (GC vs. AT) axis. Each axis is capable of encoding one bit’s worth of information. We wondered if degeneracy in bases one and three of codons might allow the “peaceful coexistence” of information on these axes, such that RY information going in one direction can effectively be superimposed on SW information going the other direction.

To test the above idea, a new metric was devised as follows:

  1. 1.

    For each gene, obtain the Shannon entropy of the RY signal in base one of codons. That is, find the average purine frequency (and pyrimidine frequency) for base one, and use it to calculate entropy in the standard way, as:

    $$entropy = f \times \left( \frac{1}{f} \right)$$
  2. 2.

    Do the same for base three.

  3. 3.

    Use the entropy values for base one and three to form a vector, [HRY1, HRY3].

  4. 4.

    Obtain the Shannon entropy of the SW signal (where ‘S’ means G or C and ‘W’ means A or T) in base one, and also in base three; and form a vector, [HSW1, HSW3].

  5. 5.

    Normalize the vectors so obtained.

  6. 6.

    Calculate their dot product. Use this as the basis of a metric.

The dot product of two vectors measures how much the vectors differ, directionally, because the dot product of normalized vectors is the cosine of the angle between them. A large difference is expected for the RY and SW vectors in the case of moonlighting genes. We expect a large angle and a small cosine, hence the score value for genes is computed according to 1−cosine. Enrichment values for M. tuberculosis are shown in Table 10, where genes are ranked by this new metric.

Table 10 Enrichment for moonlighting genes in M. tuberculosis H37Rv using a metric based on the dot product of RY and SW entropy vectors (see text for “Discussion”).

Results shown in Table 10 provide several new additional functional categories. The top four—including moonlighting genes—have fold-enrichments exceeding 2.0 and expectation values of zero. Further enrichment occurs when the RNY metric is combined with the entropy-dot-product-based metric. By simply summing the two metrics together (to produce a new metric), we were able to find 20 out of 35 moonlighting genes in Mycobacterium tuberculosis H37Rv, for a fold-enrichment of 2.86 at expectation zero. This same metric yields a 2.58-fold enrichment (E = 0) for moonlighting genes in E. coli, and a 2.25-fold enrichment in S. pneumoniae (at E = 0.009).

Antisense translation products: theoretical considerations

Based on the above results, which are consistent with our Ansatz (which says that antisense ORFs might exist in moonlighting genes), we decided to look for open reading frames in the reverse complements of moonlighting genes in our three model organisms. Until now, we have assumed naively, based on purine bias in reading frame zero, that antisense products will exist in frame zero on the complement strand. (We consider that there are three possible reading frames: zero, + 1, and + 2. These frames can exist on either strand, relative to the 5’ terminus of the strand.) But is this really a reasonable expectation? On purely theoretical grounds, we consider that there are three possible reading frames in the reverse direction, with different implications for overlap of codon information. Forward and reverse reading frames can overlap in the following ways (Fig. 4):

Figure 4
figure 4

Possible reading frame alignments in forward and reverse strands. Arrows designate the reading direction; the number “1 2 3” indicate codon base positions. ‘R’ means any purine; ‘n’ means any base. The ‘Rnn’ pattern is representative of ~ 60% of codons in any organism, across all domains of life. The top configuration (A) shows both strands in reading frame zero. In this configuration, base 2 of codons and anticodons align. In the middle configuration (B), the top strand is in reading frame + 1 (relative to the beginning of that strand) while the bottom strand is in reading frame zero. In this alignment, base 3 of codons occur opposite base 3 of anticodons. But notice that base 2 occurs opposite a purine, which means base 2 will be a pyrimidine. In the bottom configuration (C), the top strand is in reading frame + 2 (or equivalently, − 1), which means base one of codons will occur opposite base 1 of anticodons. This is not a feasible alignment if both codon and anticodon have a purine in the first-base position. See text for further “Discussion”.

In Fig. 4, complementary strands of DNA are shown adjacent each other, with codon bases numbered 1–2–3 on the bottom strand (which reads left to right, in this depiction) and anticodon bases numbered 3–2–1 on the top strand (which reads right to left). The symbol ‘R’ represents a purine; ‘n’ is any base. Arrows represent reading directions. At the top of the diagram, in the section labeled ‘A’, strands are oriented in “2 over 2” fashion: base 2 of the codon is opposite base 2 of its anticodon. This is the orientation that occurs if the translation reading frame is zero (the default) for each strand. The middle portion of the diagram, labeled ‘B’, shows codon/anticodon orientation when the top strand is in reading frame + 1. In this case, base 3 of one codon overlaps base 3 of the other; a so-called “3-over-3” orientation. The lowermost pair of strands (labeled ‘C’ in the diagram) shows the situation where the bottom strand is (as usual) in reading frame zero but the top strand is in reading frame + 2 (or, equivalently, − 1). This puts base one of the codon opposite base one of the anticodon (“1 over 1” configuration).

Clearly, in a gene that has overlapping ORFs, the “1 over 1” configuration is not feasible if each ORF has high purine bias, because a purine can never occur opposite to another purine in DNA. Therefore, we expect a + 2 antisense reading frame to be a rare occurrence. It could exist, but only if one ORF has low purine bias and the other strand has high purine bias; or if both strands have 50% purine bias.

The top configuration (2 over 2), where both strands are in reading frame zero, is at least tenable, since high purine content in base one of either strand can be offset by correspondingly high pyrimidine content in base three. This is feasible since base three is mostly degenerate; the requirement for a pyrimidine in base three comes at little cost. Nevertheless, “2 over 2” means that if base two is predominantly purines in one ORF, it must be predominantly pyrimidines in the other ORF. Thus, if the bottom strand encodes a hydrophilic polypeptide, the top strand will likely encode a hydrophobic one—a membrane protein.

The middle configuration (3 over 3), which has the top strand in reading frame + 1, will accommodate high purine bias in both strands simultaneously iff base two of each codon is a pyrimidine. Crucially, this means each ORF must encode a polypeptide with high non-polar amino acid content. (The genetic code is arranged in such a way that a pyrimidine in base two is very likely to mean a non-polar amino acid). This, in turn, means both translation products are likely to be membrane proteins.

To summarize, we expect that if an antisense ORF exists in a gene, it will only rarely be in reading frame + 2; it may be in frame zero; and it is reasonably likely to be in the + 1 frame, particularly if membrane proteins are the translation products.

Theory validation

The following question arises from the above logical deductions: How can we look for antisense ORFs? Where will they begin? The trivial answer is, they begin with a start codon; and they meet the requirement of an ORF, which is to say:

  1. 1.

    The start codon should be preceded by something resembling a Shine Dalgarno sequence.

  2. 2.

    The start codon should be ATG or perhaps GTG.

  3. 3.

    The start codon should be followed by some (suitably large) number of codons having significant purine bias in base one.

  4. 4.

    And the sequence of codons should end in a stop codon (TAA, TAG, or TGA).

A search for such structures can be performed by obtaining the reverse complement of a gene and matching it against a regular expression, such as:

$$/.\left\{ {{14}} \right\}\left( {\left[ {{\text{GA}}} \right]{\text{TG}}} \right)\left( {...} \right)\left\{ {{1},} \right\}?\left( {? = \left( {{\text{TAA}}\left| {{\text{TAG}}} \right|{\text{TGA}}} \right)} \right).../{\text{g}}$$

This expression allows a search for any 14 bases, followed by ATG or GTG, followed by 1 or more triplets that do not include a stop codon, followed by three bases. (The ‘g’ at the end simply means to search globally and report multiple results). The 14-base leader can be checked for Shine Dalgarno motifs (or a proxy measure, such as purine percentage) so as to filter low-quality hits. In order to obtain the translatable portion of a hit, the first 14 bases can simply be removed (or ignored). Hits might contain any number of codons; it is up to the user to set a suitable minimum limit. While the above regular expression produces good results, we find, in practice, that better results are obtained using:

$$/.\left\{ {{12}} \right\}\left( {..\left[ {{\text{ATG}}} \right]{\text{TGA}}..} \right)\left( {...} \right)\left\{ {{1},} \right\}?\left( {? = \left( {{\text{TAA}}\left| {{\text{TAG}}} \right|{\text{TGA}}} \right)} \right).../{\text{g}}$$

This expression searches for the combination start/stop motifs ATGA, TTGA, and GTGA, which are common in leaderless genes of all three of our model organisms (unpublished data). The first three bases of the motif constitute a start codon; the last three bases constitute the “opal” stop codon, TGA. When this motif occurs in leaderless genes, translation halts at the TGA, then begins again after a – 1 frameshift.

When we looked for putative antisense ORFs in the moonlighting genes of our three model organisms using the start/stop-motif regex, we found hits in almost every gene (see Table 11). Most of the hits (90/142 = 63.4%) were in the + 1 reading frame, as predicted.

Table 11 Summary: putative antisense ORFs in moonlighting genes of M. tuberculosis, E. coli, and S. pneumoniae.

While values of R1 (average purine content, base 1 of codons) are seemingly quite low in the hits, this is largely due to the abnormally low R1 values of the reading-frame + 2 hits, which drag the averages down. When we look at R1 values by reading frame (Table 12), we see that R1 values for reading frames zero and + 1 are reasonably high. Most likely, the reading-frame + 2 hits can be considered false positives. Some false positives also likely occur in reading-frames zero and + 1.

Table 12 Summary: average R1 (purine bias, base 1) of putative antisense ORFs in moonlighting genes of M. tuberculosis, E. coli, and S. pneumoniae.

An analysis of Y2 (pyrimidine content, base 2 of codons) shows that Y2 is significantly higher in reading-frame + 1 hits than in other reading frames (Table 13), in agreement with our prediction (see previous section) and suggests that any putative antisense ORFs that exist in frame + 1 likely encode membrane proteins. Overall, the results in Tables 11, 12, and 13 are in strong agreement with the theoretical predictions of the previous section.

Table 13 Summary: average Y2 (pyrimidine content, base 2) of putative antisense ORFs in moonlighting genes of M. tuberculosis, E. coli, and S. pneumoniae.

Putative antisense ORFs contain transmembrane domains

We attempted to gain further insight into whether the translation products of putative antisense ORFs that occur in moonlighting genes might, in fact, encode membrane proteins. To do this, we searched for antisense-direction hits using regular expressions based on:

$$/.\left\{ {{14}} \right\}\left( {{\text{ATG}}} \right)\left( {...} \right)\left\{ {{1},} \right\}?\left( {? = \left( {{\text{TAA}}\left| {{\text{TAG}}} \right|{\text{TGA}}} \right)} \right).../{\text{g}}$$

The above expression uses ATG as the presumptive start codon. However, we also used expressions containing start codons TTG, GTG, and CTG, as well as alternate start codons ATA, ATT, ATC, and TAC, based on an examination of the start codons in the annotated genes of our three model organisms. The three model organisms use all of these start codons (according to the annotated RefSeq genomes). We conducted separate regex searches using each start codon. Once putative ORFs were located, we translated the ORFs in silico and submitted the resulting polypeptide sequences to the Consensus Constrained TOPology (CCTOP) server app at http://cctop.ttk.hu/. CCTOP is a web application that takes a consensus-based approach to predicting transmembrane topologies. Using 10 different topology prediction methods, CCTOP incorporates previously determined structural and topology information into a probabilistic Hidden Markov Model. Its reliability (in terms of reduced false positives and false negatives) has been demonstrated to be better overall than HMMTOP, Phobius, or any other single transmembrane prediction technology used alone. Details can be found at the CCTOP website or in Dobson et al.6.

CCTOP predicted transmembrane domains in seven antisense ORFs from moonlighting genes of M. tuberculosis (Table 14). Some of the ORFs were overlapping. For example, glutamine synthetase subunit A1 (glnA1) was found to contain a putative antisense ORF with start codons ATG and TTG at offsets 613 and 625. Likewise, rpoB was found to contain an antisense ORF with three closely spaced start codons.

Table 14 Antisense ORFs containing predicted transmembrane domains, moonlighting genes of M. tuberculosis.

Fifteen moonlighting genes of E. coli yielded 49 asORFs with predicted transmembrane domains (Table 15). The genes were: aceB, dnaK, eno, fusA, gapA, glcB, gpmB, gpmM, murI, pfkA, pfkB, pgi, rpoB, sodC, and tuf. Again, as with Mycobacterium tuberculosis H37Rv, many of the hits overlap: for example, in fusA, start codons for what appears to be a single antisense ORF exist at offsets 682, 697, and 709. Notably, rpoB contains at least four closely spaced “start codons” inside a putative antisense ORF that spans 2892 bases in total length.

Table 15 Antisense ORFs containing predicted transmembrane domains, moonlighting genes of E. coli.

Ten moonlighting genes of Streptococcus pneumoniae were found to have putative asORFs with predicted transmembrane domains (Table 16). Six of the ten genes have asORFs containing multiple putative start codons.

Table 16 Antisense ORFs containing predicted transmembrane domains, moonlighting genes of S. pneumoniae.

Interestingly, all three organisms scored a transmembrane prediction for antisense ORFs in rpoB (DNA-directed RNA polymerase subunit beta) as well as fba (fructose-bisphosphate aldolase) and dnaK (chaperone DnaK). Notably, of the 89 total transmembrane-domain predictions that occurred across the three model organisms, 49 (55.1%) are in reading frame + 1, with only 8 in the + 2 reading frame. Most of the putative antisense ORFs in Tables 14, 15, and 16 are small (with median lengths ranging from 65 amino acids in Streptococcus to 92 in E. coli). This is not unexpected, given recent work7 showing that genes encoding small (< 100 AA) proteins may account for 16% (± 9%) of all proteins in bacteria, with many of such genes involved in membrane proteins, and some encoded in antisense8. However, not all transmembrane-containing asORFs are small. The largest putative transmembrane-containing antisense ORF that we found in E. coli is 2892 bases long and occurs in rpoB.

Discussion

In this study, we looked at forward and reverse-complement purine bias in base one of codons (and anticodons), and we found that most moonlighting genes (not only in M. tuberculosis, but also in E. coli and Streptococcus pneumoniae) score well above the CDS-genome average for forward and reverse purine bias. Based on this finding, we adopted a provisional assumption that antisense translation products might be encoded in moonlighting genes. This led us to hypothesize that codons meeting the pattern RNY (purine, any base, pyrimidine) might exist in relative abundance in moonlighting genes. And indeed, we found that by scoring all of an organism’s genes according to “RNY content,” we could enrich for moonlighting genes: we obtained fold-enrichments of 2.29–2.58, at hypergeometric expectations of zero to 0.002, in the three model organisms. We reasoned that if information is being encoded bidirectionally in at least some portions of moonlighting genes, a consequence of this would be that degenerate codon bases (base 1, base 3) would need to accommodate the informational load in such a way that information is essentially multiplexed. To test this possibility, we came up with a heuristic based on the idea that codon information can be encoded differentially along RY and SW axes. (RY refers to the IUPAC ambiguity axis of purine versus pyrmidine; SW refers to the axis of GC versus AT.) We assessed base-1 Shannon entropy in the RY axis, and base-3 Shannon entropy in that axis, for each gene; using these numbers, we created a vector [HRY1, HRY3]. In like manner, we created a vector [HSW1, HSW3] for the SW entropies of bases 1 and 3. We then calculated the dot product of the vectors so created, and used a metric of 1−dotProduct to sort genes. This metric produced a substantial enrichment for moonlighting genes in M. tuberculosis, and the combination of the RNY metric plus the dot-product metric produced significant moonlighting-gene enrichments in all three organisms.

Encouraged by these findings, we looked for antisense open reading frames (asORFs) in moonlighting genes, and found 142 of them in 81/86 moonlighting genes in the three model organisms. We predicted, on purely theoretical grounds, that most antisense transcripts would be in the + 1 reading frame; and indeed this turned out to be the case (90 of 142 asORFs were + 1). We also predicted asORFs in the + 1 frame would contain mostly pyrimidines in base two of codons. This was also the case. The average Y2 (pyrimidine, base 2) content of asORFs in reading frame + 1 ranged from 0.6309 to 0.7046. Since a codon with pyrimidine in base 2 usually specifies a non-polar amino acid, we anticipated that moonlighting-gene asORFs might encode membrane proteins. When we checked the translation products of the putative asORFs for transmembrane domains, 89 such protein products were predicted (by the CCTOP prediction server at https://cctop.ttk.hu/) to contain transmembrane domains.

Based on these findings, and based on our finding that moonlighting genes tend to co-locate with genes involved in cell wall or cell membrane construction (see the section “Location of Moonlighting Genes” further above), we propose the model for moonlighting shown in Fig. 5, which we call the THX1138 Model, named after the protagonist of George Lucas’s first motion picture (“THX1138”), in which the hero—trapped in a subterranean dystopia—escapes to the surface of the planet, where he sees sunlight for the first time.

Figure 5
figure 5

A possible scenario involving translation of a moonlighting gene. Transcription occurs bidirectionally (see green orbs, above, representing RNA polymerase). “Wrong-way” RNAPs, at the ends of the moonlighting gene, produce antisense products containing membrane-friendly polypeptides that associate with the membrane, possibly via transertion (but possibly via some other mechanism). The membrane-friendly antisense products provide firm anchors to the DNA. Intensive cell wall construction is underway in the area; this means there are gaps in the wall. The moonlighting gene, anchored at each end by transertion tethers, is held in close physical proximity to the open section of wall. A newly produced moonlighting protein, when it detaches from its ribosome, easily passes through the gap in the cell wall. It may, in fact, have nowhere else to go. See text for further “Discussion”.

Bidirectional transcription of the moonlighting gene causes nascent antisense proteins to be produced, which stick to the membrane. (This hypothesis is based on our finding that the antisense proteins in question often contain putative transmembrane domains). In growth phase, translation is transcriptionally coupled, so that if the nascent antisense protein(s) sticks to the membrane as it is being manufactured, the DNA is essentially tethered to the membrane. Genes immediately upstream and downstream of the moonlighting gene may also have antisense transertion tethers, formed through the same mechanism. (Evidence for transertion tethering of the kind mentioned here has been documented for a number of bacterial species; see the review by Woldringh8). We hypothesize that anchoring of DNA to the membrane in this fashion is (possibly) a widespread phenomenon, perhaps involving hundreds of genes. (This view is consistent with recent research on small proteins in bacteria9, many of which have ORFs that exist in overlapping reading frames10, often in antisense11). In the area between the ends of the gene, intensive cell wall construction may be occurring, and we hypothesize that there are areas where sizable (~ 10–30 nm) gaps exist—areas where bridging of the gap by transertionally tethered DNA may, in fact, be an essential structural reinforcement to prevent the weakened wall from opening up catastrophically. Meanwhile, gaps in the under-construction wall are large enough to allow whole proteins to pass through unrestricted, pushed out forcibly under turgor pressure.

Regardless of whether antisense proteins are produced, the escape of moonlighting proteins to the surface of the cell can be explained rather simply by the fact that production of moonlighting proteins occurs in close proximity to areas of intensive cell wall and cell membrane construction. A straightforward leakage hypothesis is not only warranted, but compelling—and easily explains why no secretion systems have ever been implicated in ECP release. “Non-classical secretion” is simply propitious leakage. More complicated explanations should not be pursued until simple ones have been ruled out (Occam’s Razor). Parsimony dictates that we should entertain a leakage theory before others; the burden of proof is on those who insist on more complex explanations. Given the 5–30 atmospheres of turgor pressure that exist inside a bacterial cell12, gene products produced near a “hole in the wall” might very well be forced, violently, through the hole, like a passenger blown out an airliner window after explosive decompression at 30,000 feet.

The THX1138 Model (propitious leakage aided by forced proximity via the action of membrane-bound polysomes) explains a number of aspects of “moonlighting” that have managed to elude explication for 30 years:

  1. 1.

    It explains why functionally unrelated enzymes (glycolytic enzymes, chaperones, elongation factors, superoxide dismutases, etc.) are involved. They have something in common: their antisense products are rich enough in non-polar amino acids to stick to the membrane.

  2. 2.

    It explains how ECP-type moonlighting genes achieve excretion: Aided by turgor pressure, they are squeezed out through holes in the under-construction membrane/wall complex, as a consequence of being produced in exactly the right location for this to happen, with non-polar antisense proteins anchoring the gene to the membrane in particularly porous areas.

  3. 3.

    It explains why some proteins, from the same operon, are excreted while others are not. Some glycolysis genes, for example, might produce antisense products that stick to the cell membrane, while others have no antisense products at all (or products that are too short, or too polar, to serve in the “stick to the membrane” role).

  4. 4.

    It explains why Boël et al.13 were able, by modifying the 3’ end of the GAPDH gene in streptococcus, to prevent excretion of GAPDH. Modifying the 3’ end of the gene could easily change the membrane-binding properties of a gene’s antisense product. It could reduce or eliminate the translatability of antisense product(s).

  5. 5.

    It explains why moonlighting genes are located next to membrane and cell wall construction genes: they (or rather, their antisense products) play a role in holding the membrane together while it is being built.

Our theory can be seen as encompassing two hypotheses: one is that leakage of moonlighting proteins occurs past areas of active cell wall/membrane construction. The other is that such leakage (if it occurs) is facilitated by membrane-friendly antisense proteins, which (during co-transcriptional translation) essentially tether moonlighting genes to the cell wall/membrane. While it is possible that leakage of moonlighting proteins past areas of active cell-wall/membrane construction may be occurring without help from antisense-protein-related tethering of DNA to the inner membrane, we believe the tethers (if they exist) may, in fact, be essential in “holding the door open.”

Evolutionary implications

Hundreds of genes co-enrich with moonlighting genes when a scoring metric is used that assumes bidirectionality of transcription and translation. Based on our enrichment experiments, it appears likely that many genes encode information on both DNA strands (in at least some sections). This situation is, of course, made possible by the degeneracy of the genetic code. Degeneracy is also what allows most point mutations to remain “neutral” (via synonymous codons). But in a region of bidirectional information flow, neutrality is necessarily reduced. In a configuration where codons and anticodons are offset by one base, with anticodons in reading frame + 1, neutrality is preserved in base 3, since in this configuration base 3 of codons will overlap base 3 of anticodons (Fig. 4). However, in other alignments, base 3 will overlap an information-rich (degeneracy-poor) base, and synonymous mutations will necessarily be rarer. For large regions of applicable genes, there may not be such a thing as a “neutral mutation.” From theoretical considerations, we can confidently predict that a + 1 offset of the antisense ORF will be the most neutrality-conserving alignment, since it puts base 3 of codons in alignment with base 3 of anticodons. For organisms with relatively high codon Shannon entropies, which is to say organisms having an average G + C content close to 50%, this is the alignment that gives the most protection against non-synonymous mutations. Organisms with significantly higher or lower GC content will have more “informational headroom” in their codons (because of GC or AT redundancy) and may thus be able to tolerate antisense reading frame offsets of zero or + 2. On this basis, we would predict that very-low-GC organisms might not have asORFs that are biased in favor of the + 1 antisense offset; other offsets will also be utilized. Even so, genes with significant two-way information content will (regardless of ORF framing) be less tolerant of point mutations and can therefore be expected to evolve slowly, appearing as “highly conserved genes” undergoing purifying selection. Kimura’s original “neutral theory”14 did not focus on point mutations directly; it merely posited that most “mutations” have little to no effect on phenotypes. Nevertheless the existence of bidirectional information in at least some genes constitutes an important footnote to any discussion of mutation-based evolution. Synonymous versus non-synonymous mutation rates, transversion/transition ratios, and other dynamics will need to be considered carefully in light of the bidirectionality (or non-bidirectionality) of various regions of genes. The demands of bidirectional evolution may place unusual constraints on codon composition.

Predictions of the model

Because our model allows us to construct metrics that consistently enrich for moonlighting genes, it allows for prediction of moonlighting functionality in genes that have not yet received such assignments. In Table 17, we present nine such predictions, representing genes that consistently co-enrich with moonlighting genes in all three of our model organisms. We expect some or all of these nine genes to be “moonlighters” of the ECP type. These are genes that consistently occur in our enrichment experiments, and do so across all three model organisms.

Table 17 Predicted moonlighting proteins.

While it is obviously impossible for us to predict what the secondary function of any of these genes might be, nevertheless we would, at a minimum, expect the gene products in question to exist extra-cellularly (either on the surface of cells, or in the culture supernatant), via the “propitious leakage” mechanism.

Limitations of the current study

One important limitation of our study is that we did not attempt to look for antisense ORFs that span gene boundaries. We would expect some such ORFs to exist, since roughly 15% of genes in each of our model organisms are leaderless (adjoining; in some cases, overlapping) genes that may be transcribed polycistronically. We also did not attempt a comprehensive search for intra-gene promoters in the antisense strands of moonlighting genes. Antisense intra-gene promoters are present in about 11% of M. tuberculosis genes (based on our unpublished data), and we believe these promoters may play a role in modulating the expression of asORFs. Upstream antisense promoters might also exist in the neighbors of moonlighting genes. This is an area for further research.

Our enrichments for moonlighting genes, though moderately successful (with fold-enrichments of 2.0–3.0), did not “find” all moonlighting genes. So it’s fair to ask, why not? Why didn’t all moonlighters enrich? We believe there are several possible answers. First, our metrics did not take into account effects that might involve antisense ORFs that cross gene boundaries (as mentioned above), and it is possible such effects could be important. Moonlighting is, after all, we believe, a hitchhiking phenomenon, arising from the tendency of certain chaperones, metabolic genes, and others to “ride the coat-tails” of cell-wall synthesis genes in the course of many syntenic crossover events and/or other gene relocation events. It would make sense if moonlighting is related not only to antisense products arising from moonlighting genes themselves, but nearby neighbor genes as well. It may also be that some moonlighting genes have silent secretion partners in the form of hypothetical proteins. This could be particularly true for M. tuberculosis, where enrichment stats were generally weaker than for E. coli or S. pneumonia. In M. tuberculosis, more than in the other two organisms, moonlighting genes tend to cluster near hypothetical protein genes, as a result of that organism having 26.9% hypothetical protein genes versus 5.5% for E. coli and 9.7% for Streptococcus But the answer to the question “Why didn’t all moonlighting genes enrich?” might be simpler still. Our enrichment probes were effective in enriching many other categories of genes (e.g. genes for ribosomal proteins, cell wall biogenesis, fatty acid synthesis, transporters, permeases, and others), suggesting that antisense ORFs with the potential to produce small membrane proteins might be extremely common, involving perhaps ~ 20% of all genes. Our techniques enriched all of those genes, making dilution of our moonlighting harvest inevitable.

More generally, a limitation of the current study is that we did not undertake wet-lab investigations to determine if antisense ORFs are actually translated, nor did we search the ribosome-profiling data online to see if any of these ORFs have been uncovered in high-throughput profiling. We identified antisense ORFs using what we feel is industry standard ORF-calling logic, but we are unable to make (and do not make) any claim, one way or the other, on whether these ORFs are, in fact, translated in vivo. We invite other researchers to pursue this area further.

Conclusions

In this study, we found that moonlighting genes of three model organisms (Mycobacterium tuberculosis H37Rv, Escherichia coli NCTC11775, and Streptococcus pneumoniae NCTC11032) tend to co-locate, on the genome, near genes involved in cell wall biogenesis, secretion, and inner or outer membrane synthesis. We were able to create a simple bioinformatics probe that quantifies the potential for reverse transcription and antisense translation, and found that such a probe allows us to discover moonlighting genes by means of a straightforward enrichment assay technique. Based on theoretical considerations, we predicted that if any antisense open reading frames existed in moonlighting genes, they would most likely exist in reading frames zero or + 1; frame + 2 would probably not be well utilized. We also predicted that any ORFs found to be in reading frame + 1 would encode proteins with a high percentage of nonpolar amino acids. We were able to validate these predictions. When we looked for antisense ORFs in moonlighting genes, we found 142 putative ORFs across the three model organisms, 90 of which were in reading frame + 1. Moreover, the 90 antisense ORFs of reading frame + 1 had comparatively high non-polar amino acid content. When we checked the putative translation products of antisense ORFs using the CCTOP transmembrane prediction server, we found that seven translation products of M. tuberculosis, fifteen products from E. coli, and ten products from S. pneumoniae contained predicted transmembrane domains. Most of the remaining products are expected to be membrane proteins based on high nonpolar amino acid content. Based on these findings, we presented a model that proposes a role for antisense membrane proteins in binding moonlighting genes to the bacterial inner membrane, in areas of active cell wall construction. Because moonlighting proteins are produced in “forced proximity” to porous areas of new cell wall, and because turgor pressure in bacterial cells is extreme (5–30 atmospheres), escape of moonlighting proteins to the exterior of the cell, through gaps in the wall, is (we believe) unavoidable. Our model allows us to predict that certain proteins (which have not yet been found to be moonlighters) will likely have an extracellular role. Thus, we made specific predictions for nine genes: rpoC, gyrA, gyrB, LigA, typA, ptsP, infB, purH, and aspS.