Introduction

Transcription, which inheres in all living organisms, is a basic mechanism for maintaining and changing cellular states by switching the expression of various genes on and off. Eukaryotes, which contain an RNA polymerase II (pol II) transcription system for messenger RNA synthesis, possess multiple initiation factors called general transcription factors (GTFs)1. TBP is a GTF that binds to an upstream sequence of a target gene, the so-called TATA-box, after which another GTF, TFIIB, binds to TBP and its adjacent sequence, the B recognition element (BRE)2,3,4. The TBP-TFIIB complex on the promoter region recruits pol II and additional GTFs, such as TFIIF, TFIIE and TFIIH4. Since TBP and TFIIB/TFIIB-like factor are present in all three eukaryotic RNA polymerase systems (TBP and Rrn7 in the pol I system, TBP and TFIIB in the pol II system and TBP and Brf1 in the pol III system)5 and conserved among archaea and eukaryotes6,7, the common ancestors of archaea and eukaryotes must also possess TBP and TFIIB in their transcription apparatus. Therefore, elucidation of their molecular evolution should provide critical information about the initial form of the pol II-type transcription apparatus. Moreover, the evolutionary analysis of transcription initiation factors and transcription systems would shed light on the initial forms of cellular life and its evolution.

Although previous studies have reported that the DNA-binding domain of TBP exhibits sequence similarity to that of eubacterial transcription initiation factors (sigma factors)8, their tertiary structures were later shown to be different9,10, suggesting that the two domains have distinct origins. TFIIB was then suggested to have functional similarity to the sigma factors11, but their amino acid sequences and tertiary structures were different11. So far, therefore, the evolutionary origins of TBP and TFIIB remain elusive. In general, to elucidate the evolution of a particular gene family, a phylogenetic analysis is utilized to estimate the evolutionary distances from the most recent common ancestor (MRCA) to present offspring genes in the gene family12 (Fig. 1A,B). First, in the evolutionary analysis, the evolutionary distance for every pair of sequences is calculated (Fig. 1A). Then the calculated evolutionary distances are used to prepare a phylogenetic tree (Fig. 1B). Since all sequences are equivalent on the phylogenetic tree, it is impossible to choose the MRCA among nodes in the phylogenetic tree. There are no reasonable measures for selecting one specific node as the MRCA at this point. However, once we define an outer group, which is usually an isolated clade in the phylogenetic tree, the node nearest to the outer group can reasonably be considered the MRCA of the remaining genes. For example, if the clade containing Mj (Sa, Pw and Mj) is chosen as an outer group (the left panel of Fig. 1C), the MRCA of Sc, At and Hs can be determined (the black spot in the left panel of Fig. 1C). When the clade containing Sc (At, Hs and Sc) is chosen as an outer group (the right panel of Fig. 1C), the MRCA of Sa, Mj and Pw can be determined on the phylogenetic tree (the black spot in the right panel of Fig. 1C). In this way, the phylogenetic analysis suggests the evolutionary changes of the family genes. However, a problem arises from this approach13: it is impossible to estimate evolutionary distances from the MRCA to all family genes, because some of the family members must be excluded from consideration as an outer group. To obtain the MRCA information for both archaea and eukaryotes, outer group gene information such as eubacterial counterparts is required. To date, however, no TBP counterparts have been identified in eubacteria. Moreover, even if eubacterial counterparts were discovered, we would still need to utilize these counterparts as an outer group. In any case, it is impossible to obtain MRCA information for all known TBP genes. This is an unsolved dilemma in phylogenetic analysis. In order to overcome this limitation, there is need of a novel indicator that circumvents the requirement of an outer group.

Figure 1
figure 1

Schematic representation of the present evolutionary distance (d) and the distance between the first and second repeats (dDR).

(A) Schematic representation of evolutionary distances. The evolutionary distances between two genes in the gene family are shown as d1d6. (B) Calculated evolutionary distances d1d6 are utilized to prepare an unrooted phylogenetic tree. (C) Schematic drawing of an unrooted phylogenetic tree of archaeal and eukaryotic genes. The position of the MRCA for both archaeal and eukaryotic genes cannot be determined on the unrooted phylogenetic tree. When archaeal genes are considered as an outer group (distal relative genes), the MRCA for eukaryotic genes can be placed (left panel). When eukaryotic genes are considered as an outer group, the MRCA for archaeal genes can be placed (right panel). (D) Relationship of gene duplication, accumulated mutations and dDR. The EA-gene (middle panel) is generated by a gene duplication of a prototype gene (upper panel). The dDR value of the EA-gene (t = 0) is zero due to two identical nucleotide sequences in the direct repeat. The dDR value of the present offspring gene (lower panel) can be utilized as an indicator of the evolutionary distance between the EA-gene and the present offspring gene. (E) Relationship between the phylogenetic tree and dDR defined in this study. d is the path length “between two distinct genes” via their MRCA (e.g., the red line in the right panel)12. On the other hand, dDR is the path length “between the first and second repeats in one gene” via the hypothetical EA-gene (e.g., the red line in the left panel). Therefore, d reflects the evolutionary distance between the present gene and one of the ancestral genes, but dDR could be a reasonable indicator of the evolutionary distance between a present gene and its EA-gene. (F) Schematic drawing of the relationship between the phylogenetic tree and dDR. The usual phylogenetic tree is prepared based on the evolutionary distances (d) of two genes in the gene family. On the other hand, dDR is a reasonable indicator of the evolutionary distances between the EA-gene and each of the present offspring genes.

Here, we present a new analytical method to estimate evolutionary distances from a common earliest ancestral-gene (EA-gene) to all present offspring genes without using an outer group for the evolutionary analysis of TBP and TFIIB. Our analysis utilizes direct repeat sequences found in TBP and TFIIB as an evolutionary measure of the degree of similarity between the present offspring genes and their ancestors billions of years ago. Considering that the gene duplication is the beginning of a gene with a direct repeat, the evolutionary distance between the first and second repeats can be used as a new indicator of the evolutionary distance between ancestral and present offspring genes. Using this indicator, our analysis suggests the evolutionary changes of TBP and TFIIB. We also provide the first data on the evolutionary development of the transcription apparatus using this indicator.

Results

A new indicator for determining the distance from an ancestor

As described in the Introduction, a new indicator is needed to measure the evolutionary distances between an ancestral gene and the whole set of present offspring genes of a particular gene family. Since methods utilizing an outer group cannot overcome the limitation of the phylogenetic analysis, any potential indicators must utilize information embedded in gene sequences as an evolutionary measure. TBP and TFIIB are found in archaea and eukaryotes and no counterparts have been found in eubacteria.

We noticed that information in the direct repeat sequences present in both TBP and TFIIB could be utilized for analyzing the evolution of the genes without an outer group. Generation of a direct repeat can be considered the starting point of the molecular evolution of the direct repeat-containing genes. It should be noted that the first and second repeats are identical at the time of its generation (middle panel in Fig. 1D). In this study, we designate a direct repeat-containing gene that was generated just after gene duplication as the earliest ancestral-gene (EA-gene) in order to distinguish the EA-gene from other types of ancestral genes, including the MRCA. The EA-gene can be considered the first-appearing common ancestor of a direct repeat-containing gene.

Importantly, the nucleotide sequences of the first and second repeats, which were once the same, have diverged by mutations during evolution. Therefore, a sequence comparison between the first and second repeats allows us to evaluate the evolutionary distance between the EA-gene and its offspring gene(s) (lower panel in Fig. 1D). Here, we would like to propose that the evolutionary distance between the first and the second repeats can be utilized as an indicator of the evolutionary distance between the EA-gene and its offspring gene(s); the newly defined evolutionary indicator is designated as distance between Direct Repeat (dDR). Obviously, dDR is zero (dDR = 0) for the EA-gene, because nucleotide sequences of the first and second repeats were identical at that time. Thereafter, dDR gradually increased due to accumulated mutations in the course of evolution.

We can now analyze the molecular evolution using dDR and we designate this novel method the “dDR analysis”. It must be noted that the new indicator dDR is essentially different from the general evolutionary distance (d) (Fig. 1E). The dDR analysis does not utilize an outer group and thus dDR could be a reasonable indicator of the evolutionary distance between a present offspring gene and its EA-gene (Fig. 1F).

Two criteria for the new indicator

The appropriate calculation of dDR needs to satisfy two criteria. First, gene duplication of the target gene should occur only once during the evolution of family genes. If an offspring gene has a second gene duplication, the accumulated mutations of the offspring gene will be lost at that time point; in such a case it can be considered that another EA-gene is generated among the family genes. When comparing dDR, the EA-gene should be the same among the target genes, because the EA-gene is defined as the origin of the evolution of the family genes. Therefore, we must examine whether or not the gene duplication of a target gene has occurred only once.

The method reported by Gogarten et al.14 and Iwabe et al.15 can be used to make this determination. In the case of direct repeat genes, a phylogenetic tree is prepared by using the first and second repeats of the target genes. The branch pattern of the obtained phylogenetic tree provides information on gene duplication in the evolution of the target genes6. A phylogenetic tree of genes generated by single gene duplication shows two distinct clades that consist of the first and second repeats, respectively. If gene duplication occurred more than once, each clade in the phylogenetic tree would consist of a mixture of the first and second repeats.

The second criterion is that the direct repeats in the target genes should have a moderate conservation ratio. Our indicator cannot be applied to genes that show a low conservation ratio between the first and second repeats. Since nucleotide sequences are composed of four nucleotides, a minimum sequence identity of approximately 25% is permitted between two random nucleotide sequences, if there is no bias in nucleotide substitutions in the evolution. Therefore, we cannot distinguish between random noise and significant homology of the nucleotide sequences in the direct repeats, when the conservation ratio of the direct repeats is close to 25%.

Evolutionary distances to the EA-gene of TBP

Prior to calculation of the dDR values for TBP genes, we examined whether or not TBP genes satisfied the two aforementioned criteria. TBP has a direct repeat with approximately 180 amino acids in the C-terminal core region8,16. Accordingly, the tertiary structure of the corresponding region of TBP has a pseudo-two-fold axis9. Therefore, the EA-gene of TBP seems to be generated by gene duplication of a prototype gene. We examined the nucleotide sequences of TBP from 34 species (Fig. 2A) and found that the first and second repeats of the TBP genes were clearly divided into two distinct clades (Fig. 2B), indicating that the first criterion for our analysis was satisfied for the TBP genes.

Figure 2
figure 2

Direct repeats present within TBP and TFIIB are derived from their EA-genes generated by single gene duplication.

(A,D) Phylogenetic trees drawn with the nucleotide sequences of the conserved core region of TBP (A) and TFIIB (D) from 34 species. Abbreviations of species names are given in the footnote of Table 1. (B,E) Phylogenetic trees drawn with the nucleotide sequences of the first and second repeats of TBP (B) and TFIIB (E) from 34 species. Red and cyan indicate, respectively, the first and second repeats of the TBP (B) and TFIIB (E) genes. (C,F) The dDR values of TBP (C) and TFIIB (F) are shown on their phylogenetic trees using red-blue coloring by the dDR values.

Next, we examined whether the 34 TBP genes satisfied the second criterion. The first and second repeats of the 34 TBP genes were compared on the basis of a nucleotide sequence alignment (Supplementary Fig. S1). Since the sequence identities between the first and second repeats of these TBP genes ranged from 39.9% to 61.5%, the second criterion for our analysis was also satisfied (Table 1A).

Table 1 Identity between the first and second repeats and dDR of the TBP (A) and TFIIB (B) genes.

Since all 34 TBP genes satisfied the two criteria, the dDR value for each TBP gene was calculated (Table 1A). Since the dDR values represent the distances from the EA-gene, all of the present TBP genes could be ordered according to their similarities to the EA-gene. The results also showed that archaeal TBP genes have dDR values (0.488–0.933) lower than those of eukaryotic genes (1.00–1.22) (Table 1A). Among all the examined TBP sequences, the TBP gene from M. jannaschii (hereafter Mj TBP) exhibited the lowest dDR (0.488). Species that are branched into the same clade of M. jannaschii (Methanococcus maripaludis, Methanocaldococcus fervens, Methanocaldococcus infernus and Methanothermococcus thermolithotrophicus) also have TBP genes with low dDR values (0.516, 0.524, 0.524 and 0.560, respectively) (Fig. 2C). These results indicate that Mj TBP and its close relative genes are more similar to the EA-gene than to the other genes.

Evolutionary distances to the EA-gene of TFIIB

The molecular evolution of TFIIB was also analyzed according to the dDR values. TFIIB contains a direct repeat with approximately 190 amino acids in the C-terminal core region17,18. Accordingly, the tertiary structure of the corresponding region of TFIIB has a tandem cyclin fold19,20 (Supplementary Fig. S2). The same analysis14,15 for TFIIB genes in the present 34 species (Fig. 2D) indicated that the first and second repeats of the TFIIB genes are clearly divided into two distinct clades (Fig. 2E) and thus that the first criterion for our analysis was satisfied for the TFIIB genes. Since the sequence identities between the first and second repeats of these TFIIB genes ranged from 32.6% to 51.4%, the second criterion for the dDR based analysis was also satisfied (Table 1B).

The calculation of the dDR values for the TFIIB genes also allowed the present TFIIB genes to be ordered according to their similarities to the EA-gene, revealing that archaeal TFIIB genes have low dDR (0.784–1.26) and eukaryotic ones have high dDR (1.28–1.95) (Table 1B), as observed in the case of the TBP genes (Table 1A). Among these genes, the TFIIB gene from M. maripaludis (hereafter MmTFIIB) showed the smallest dDR among the 34 TFIIB genes (0.784). M. maripaludis and M. jannaschii are branched into the same clade and, like TBP, species in this clade (M. fervens, M. thermolithotrophicus, M. jannaschii, M. infernus) have TFIIB genes with low dDR values (0.820, 0.828, 0.869 and 0.918, respectively) (Fig. 2F).

Introduction of a new evolutionary indicator, dDR, thus enabled us to estimate the evolutionary distances between the EA-gene and present offspring genes for TBP and TFIIB without setting an outer group. Moreover, the combination of dDR and a phylogenetic tree allowed us to determine the clade containing genes most similar to the EA-gene at a glance (Fig. 2C,F).

Analysis of evolutionary change of the amino-acid composition

Since dDR reflects the evolutionary distance between the EA-gene and its present offspring gene(s), we prepared a list ranking the 34 TBP genes according to their similarities to the EA-gene (Table 1A). This list was utilized to predict evolutionary changes of the amino-acid composition of TBP (Supplementary Table S1A). The dDR values for the TBP genes show a significant correlation with the numbers of Asp, Glu, Arg, Phe and Ser residues; the dDR value and the numbers of Asp and Glu residues have a strong negative correlation (r = −0.77 and −0.76, respectively) and the numbers of Arg, Phe and Ser residues show a strong positive correlation (r = 0.76, 0.82 and 0.71, respectively) (Fig. 3A).

Figure 3
figure 3

Correlation between dDR and the amino-acid compositions of TBP.

(A) Correlations between dDR and the number of the specific amino-acid residues for TBP (Asp (D), Glu (E), Arg (R), Phe (F) and Ser (S)). The best fitting lines are shown in red. The correlation coefficient (r) and p-value (p) are shown in each graph. (B) Sphere models of TBP molecules from M. jannaschii (Mj), S. acidocaldarius (Sa), P. woesei (Pw), S. cerevisiae (Sc), A. thaliana (At) and H. sapiens (Hs). The upper and lower panels show front and back views, respectively. Asp and Glu residues are shown in red, Arg residues are shown in blue, Phe residues are shown in cyan and Ser residues are shown in yellow. The dDR value of each molecule is also shown. Black curved lines indicate the DNA binding surface of TBP. Dotted curved lines indicate the TFIIB binding surface of TBP.

These results are consistent with the fact that, while the core region of all TBPs has a hydrophobic DNA-binding surface surrounded by positively charged residues, the surface properties of the other core regions show significant difference among species. TBP has a larger number of acidic residues than basic residues in M. jannaschii (34 acidic and 22 basic residues)21, an almost equal number of both charged residues in Sulfolobus acidocaldarius (22 acidic and 26 basic residues)22 and Pyrococcus woesei (26 acidic and 25 basic residues)23 and a smaller number of acidic residues than basic residues in Saccharomyces cerevisiae (15 acidic and 27 basic residues)24, Arabidopsis thaliana (15 acidic and 29 basic residues)25 and Homo sapiens (13 acidic and 29 basic residues)26 (Fig. 3B).

The strong correlation between dDR and the number of Asp, Glu, Arg, Phe and Ser residues suggested that the observed differences of amino-acid composition of TBP proteins might be correlated with the evolutionary changes of TBP genes. In addition, assuming that the relationship between dDR and the amino-acid composition is linear throughout the evolution, we can make another quite informative prediction (see Discussion).

In the case of TFIIB, dDR and evolutionary changes of the amino-acid composition were also compared (Supplementary Table S1B). dDR and the number of Arg residues have a strong negative correlation with dDR (r = −0.74) and the number of Gln residues has a strong positive correlation (r = 0.75) (Fig. 4A and Supplementary Table S1B). However, the number of Asp, Glu, Phe and Ser residues showed no correlations with dDR in TFIIB. This is consistent with the fact that the surface property of the core region of TFIIB does not show large differences between P. woesei27 and H. sapiens20 (Fig. 4B).

Figure 4
figure 4

Correlation between dDR and the amino-acid compositions of TFIIB.

(A) Correlations between dDR and the number of the specific amino-acid residues for TFIIB (Arg (R) and Gln (Q)). The best fitting lines are shown in red. The correlation coefficient (r) and p-value (p) are shown in each graph. (B) Sphere models of TFIIB molecules from P. woesei (Pw) and H. sapiens (Hs). The upper and lower panels show front and back views, respectively. Arg residues are shown in blue and Gln residues are shown in green. The dDR value of each molecule is also shown. Black curved lines indicate the DNA binding surface of TFIIB. Dotted curved lines indicate the TBP binding surfaces of TFIIB.

Evolutionary relationship between TBP and TFIIB

In order to analyze the evolutionary relationship between TBP and TFIIB, their dDR values were compared. As a result, a significant correlation was found between their dDR values (r = 0.75, p < 0.01) (Fig. 5A). The coefficient of the best fitting line is approximately 1.0. Assuming that the comparison of the dDR values between TBP and TFIIB is evolutionary valid, the mutations in the TBP and TFIIB genes seem to have been accumulated at a nearly similar rate when measuring the evolutionary distance with dDR (Fig. 5A, see Discussion).

Figure 5
figure 5

Evolutionary relationship between TBP and TFIIB.

(A,B) Evolutionary correlation between the TBP and TFIIB genes analyzed by dDR (A) and dMj (B). The best fitting lines are shown in red. The correlation coefficient (r) and p-value (p) are shown. (C,D) Correlation between dDR and dMj for TBP (C) and TFIIB (D). The correlation coefficient (r) and p-value (p) for each fitting are shown. Several plots near the x-axis deviate from the best fitting lines. These are plots for close relatives of M. jannaschii. Since the starting point of dMj-calculation is the M. jannaschii gene, dMj decreases quickly in the close relatives of M. jannaschii. However, dDR does not decrease in the close relatives of M. jannaschii, because the starting point of dDR-calculation is the EA-gene. This is the reason for the observed deviations from the best fitting line in the dMj plots.

We next analyzed the relationship between dDR and the general evolutionary distance (d) to assess the consistency between the two values. In this analysis, evolutionary distances between M. jannaschii and each of another species (dMj) were utilized (Supplementary Table S2), because our dDR analyses suggested that the TBP and TFIIB genes from M. jannaschii were one of the most similar genes to their EA-genes (Table 1).

First, we examined the evolutionary correlation between the TBP and TFIIB genes using the dMj values. As shown in Fig. 5B, the dMj values of the TBP genes showed a good correlation with those of the TFIIB genes (r = 0.82, p < 0.01), with the coefficient of the best fitting line being approximately 1.0. These are essentially identical to the results obtained using the dDR values (Fig. 5A). In addition, there was a strong correlation between dDR and dMj in both the TBP and TFIIB genes (r = 0.78, p < 0.01 for TBP; r = 0.88, p < 0.01 for TFIIB) (Fig. 5C,D). These results showed that the analysis using our indicator dDR was consistent with that using dMj, suggesting that dDR can also be utilized as an indicator of the molecular evolution.

It is of note that dDR and dMj are calculated in different ways. The dMj value is calculated by using two different genes and considered as “molecular clock”28. On the other hand, the dDR is calculated by using the first and second repeats of a single gene and can be designated as a molecular clock inside a gene (i.e., an internal molecular clock). Because the two values are derived from distinct nucleotide-sequence comparisons, it seems unlikely that their strong correlation (0.78 for TBP and 0.88 for TFIIB) is merely coincidental and thus dDR would appear to be a reasonable indicator for analyzing molecular evolution.

Discussion

In this work, we introduced a new indicator of molecular evolution, dDR, to estimate the evolutionary distance between the EA-gene and its present offspring gene(s) (Figs 1 and 2 and Supplementary Fig. S3). The dDR analysis can be applied to any genes which contain an intramolecular direct repeat and satisfy the two aforementioned criteria (see “Two criteria for the new indicator”). Since no outer group is required to calculate the dDR values, the dilemma described in the Introduction can be overcome (summarized in Fig. 6A). As a result, we were able to obtain novel information about the molecular evolution of TBP and TFIIB present in both archaea and eukaryotes.

Figure 6
figure 6

Schematic representation of the dDR analysis and its application.

(A) Comparison between the present method (d) and our novel analysis (dDR). (B) Schematic representation of the evolutionary development of TBP. The dDR values of TBP are indicated on the phylogenetic trees using a color continuum from red (low dDR) to blue (high dDR). (C) Our analysis implies that the TBP gene was generated after the emergence of the TFIIB gene. This is the first time that the emerging order of TBP and TFIIB genes has been reported in the study of molecular evolution and our results should thus provide novel insights into the evolutionary development of the transcription apparatus and other systems. The order of emergence of other general transcription factors, such as TAFs, TFIIA, TFIIE, TFIIF, TFIIH, remains unknown.

First, the dDR analysis enabled us to predict the species whose TBP and TFIIB genes are most similar to their EA-genes (Fig. 6B, Table 1). In previous studies, the ancestral transcription system has been discussed under the hypothesis that the archaea maintain their early molecular system. This assumption is based on several lines of collateral evidence: for example, the components of the archaeal transcription system are simpler than those of the eukaryotic system. Additionally, the current evolutionary method using eubacterial RNA polymerase (RNAP) as an outer group shows that the subunits of archaeal RNAP are more similar to their MRCA than those of eukaryotic pol II29,30. In this study, our dDR analysis has succeeded for the first time in providing direct and quantitative evidence that this hypothesis is reasonable from the viewpoint of the molecular evolution of TBP and TFIIB (Table 1). Our finding is also consistent with a recent study which found that the archaeal root sits within the methanogens13.

Second, the dDR values of the TBP and TFIIB genes from various species can also be utilized to predict the amino-acid composition of their EA-proteins (Figs 3 and 4). Assuming that the relationship between dDR and the amino-acid composition is linear during evolution and the best fitting line can be extrapolated to the y-axis, the EA-protein of TBP may contain approximately 19 Asp, 31 Glu, no Arg, no Phe and 4 Ser residues (Fig. 3A), suggesting that the EA-protein of TBP has a more acidic molecular property than that of Mj TBP. In the same way, the EA-protein of TFIIB might have a more basic molecular property than that of MmTFIIB (Fig. 4A). It may also be suggested that the number of acidic residues in TBPs and basic residues in TFIIB will decline in the distant future.

Although we previously reported that the surface charge distribution of TBP molecules show significant differences21, until now it has been difficult to explain the phenomenon with relation to their molecular evolution. The dDR analysis provides an explanation. The TBP gene seems to have been generated from its prototype gene by duplication and the EA-protein of TBP seems to have been an acidic molecule and then to have decreased its acidic residues, probably to improve its interaction(s) with other factors, such as DNA and TBP-interacting proteins. On the other hand, the EA-protein of TFIIB seems to have been a basic molecule and then to have decreased its Arg residues in the course of its evolution, also probably to improve its interaction with other factors, such as TBP, whose Arg residues were increased over the course of its own evolution. These changes in the numbers of Arg residues may have contributed to a reduction in the electrostatic repulsion between TBP and TFIIB.

Interestingly, the number of Phe residues in TBP and that of Gln residues in TFIIB have a positive correlation with dDR. Since these two residues have only two codons, their numbers in a molecule are likely to be decreased by random mutations. These observations suggest that the mutations accumulated on these residues are not random but rather under a specific selective pressure. There might be a mechanism underlying the observed evolutionary tendencies and the relationships of the amino-acid compositions in TBP and TFIIB (Supplementary Table S1). Further analyses may uncover this mechanism.

When the evolutionary development of archaeal and eukaryotic transcription apparatus is considered, the following question is immediately raised: how was the TBP-TFIIB system established? Three possibilities can be considered for the evolutionary emergence of the TBP-TFIIB system: (i) TBP was generated first, (ii) TFIIB was generated first, or (iii) both were generated at the same time (Fig. 6C). The dDR analysis allows us to envisage the evolutionary development of the archaeal and eukaryotic transcription apparatus. Under the assumptions that the dDR values of TBP and TFIIB can be compared directly, the relationship between the dDR values of TBP and TFIIB is linear throughout evolution and the best fitting line can be extrapolated to the y-axis, a surprising result was obtained (Fig. 5). Since the coefficient of the best fitting line in Fig. 5A is approximately 1.0 and mutations in the TBP and TFIIB genes are accumulated at a nearly similar rate (Fig. 5A), the positive y-intercept of the best fitting line implies that mutations had already been accumulated on the TFIIB gene when the TBP gene was generated. This in turn suggests that the TFIIB gene was generated before the TBP gene.

Considering that TFIIB is a pol II interacting-factor and forms a complex with pol II11, functional modulation of pol II might be evolutionarily initiated by direct interaction with TFIIB. This is one possible hypothesis to explain the development of the early transcription apparatus and its regulation. It is interesting that the crystal structures of the pol II-TFIIB complex11,29,30 and eubacterial RNAP-sigma complex10,31,32,33 suggested a functional relationship between TFIIB and sigma factor. Pol II/RNAP and its interacting factor(s) such as TFIIB and sigma factor would be earlier forms of the transcription apparatus in the evolution, as the dDR analysis suggested. On the other hand, TBP does not form a stable complex with pol II11 but forms a complex with various GTFs for transcriptional activation34,35,36. The ability of TBP to interact with other GTFs may have been acquired in earlier archaea and eukaryotes along with the change of the surface properties of TBP. Since the environment surrounding eubacteria and archaea is known to affect the amino-acid compositions of their molecules37, a large difference in the environment between earlier archaea and eukaryotes could have facilitated the change in the surface properties of eukaryotic TBP, leading to the association of other GTFs to establish a more complicated regulatory system of transcription.

This hypothesis could not be proven from the present method(s) of molecular evolution, because by these methods the order of evolutionary emergence of different genes is only estimated from their different distribution among various species12. For instance, genes encoding transcription enzymes exist in all three domains, but TBP genes do not exist in eubacteria6,7, suggesting that the emergence of transcription enzyme genes occurred earlier than the emergence of TBP genes. However, this idea could not be applied to the relationship between TBP and TFIIB, since their distributions among species are the same; both TBP and TFIIB only exist in archaea and eukaryotes. Provided that several conditions can be assumed, the dDR analysis could predict the order of emergence of the TBP and TFIIB genes for the first time. Further development of a quantitative measurement of molecular evolution would provide detailed insights into the evolutionary development of the transcription apparatus at the system level. We believe that our approach may ultimately lead to a new field of molecular evolution, which might be called “mechanistic and system evolution”.

Methods

Sequence analyses

Evolutionary distances (dDR and d) were calculated using the nucleotide sequences of the TBP and TFIIB genes from 34 species (Table 1). All sequences were derived from the NCBI database. Multiple nucleotide-sequence alignments of direct repeats were performed as follows. First, a multiple amino-acid-sequence alignment of the core region was generated with the amino-acid-sequence data of the 34 species (Table 1) by using ClustalW238 and the alignment was manually improved according to the results of BLAST239. Then, the amino-acid sequences of the first and second repeats of TBP and TFIIB were aligned using the amino-acid-sequence alignment of the core region on the basis of the tertiary structural superposition of the first and second repeats of TBP and TFIIB. The tertiary structural superposition was obtained from the DALI software40. Then, the nucleotide-sequence alignments were prepared on the basis of the amino-acid-sequence alignments for the first and second repeats of TBP and TFIIB (Supplementary Figs S1 and S2). Evolutionary distances and Newick formats were calculated by the program MEGA5.05 by using a maximum composite likelihood method with default parameters41. Nodes of phylogenetic trees in Fig. 2A,C,D and F were swapped by NJplot42 according to the dDR value. Phylogenetic trees were drawn by the program Unrooted42. The molecular graphics were prepared by PyMOL (http://www.pymol.org).

Additional Information

How to cite this article: Adachi, N. et al. Uncovering ancient transcription systems with a novel evolutionary indicator. Sci. Rep. 6, 27922; doi: 10.1038/srep27922 (2016).