Introduction

A vast number of enzymes rely on metal cofactors for catalysis and/or redox conversions. Different types of such “metalloenzymes” appear to have evolutionary roots which reach more or less deeply into the distant past of life on Earth. The varying depths of evolutionary pedigrees likely reflect constraints imposed by paleogeochemistry such as abundances and solubilities of metals in specific environments and eons of our planet's history1. Iron-sulphur cluster-containing ferredoxins, for example, probably go back to the very origin of life2. Copper enzymes, by contrast, are argued to have emerged only after the oxygenation of the environment some 2.5 billion years ago due to its insolubility under anaerobic conditions3.

Whereas the details of iron's and copper's involvement in numerous biological reactions have been studied for more than a century, the precise role of other metals, although recognized as vital trace elements in enzyme catalysis, became elucidated only recently. Molybdenum (Mo) has during the last 2 decades been shown to constitute an essential cofactor in at least 3 distinct enzyme superfamilies4, the most widespread of which is the so-called Complex Iron-Sulfur Molybdoenzyme (CISM) superfamily of molybdo-pterin containing enzymes. Incidentally, this denomination ignores the fact that a few members of the family use tungsten (W) instead of molybdenum in their active sites. In the periodic table of elements, tungsten lies directly below molybdenum in the d-block and is thus expected to feature chemical properties related to those of Mo.

As for the case of copper, chemists and geochemists have argued that molybdenum is unlikely to have played a biological role prior to the advent of oxygenic photosynthesis5. This is in conflict with the molecular phylogeny of the CISM-superfamily which we present below.

Results

Individual enzymes from the CISM superfamily catalyze an astonishing variety of reactions with diverse substrates, some of which will be discussed in more detail below. Furthermore, many individual subfamilies comprise representatives from both Bacteria and Archaea. Setting out to reconstruct the composite phylogeny of a large part of the superfamily therefore implies dealing with strongly diverging sequences, as well as a high chance for the presence of several paralogs in the same species. Obtaining inferences from reconstructed phylogenies of enzyme superfamilies presents two major obstacles6,7,8,9. (1) The reliability of conventional multiple alignment algorithms frequently breaks down when strongly divergent sequences are analyzed. (2) The presence of several paralogs had led to erroneous annotations in the databases blurring the evolutionary message suggested by the trees.

We have in the past turned to assisting multiple alignments by 3D structure comparisons6 which in many cases substantially improved alignment quality. The annotation problem was addressed by taking into account all available phylogenetic markers such as operon organization or functionally important amino acid residues7 to arrive at as reliable as possible enzyme identifications. The case of the CISM enzymes exacerbates the mentioned adversities for phylogenetic reconstruction due to their extraordinary variability in enzyme function and wide species distribution. Extensive resorting to methods remedying these problems therefore proved crucial for arriving at sufficiently reliable phylogenetic tree topologies.

Redressing annotations and multiple alignments

Amino acid sequences of a representative member from each enzyme family for which a 3D structure is available were used as queries for BLAST searches on completed and draft genomes in the NCBI-database. The cases of ethylbenzene dehydrogenase and acetylene hydratase were, despite the presence of 3D structures, not included in this analysis since only very few members of the respective families have been identified so far. For the sake of balanced weights of individual enzyme subtrees, the Nap and Nas families of nitrate reductases were omitted. The composite tree of the nitrate reductase enzymes was published previously9 and inclusion of Nap and Nas does not alter the topology of the Nar subtree9. In the absence of a 3D structure for a given family, i.e. arsenate reductase (Arr) for the ensemble of considered subfamilies, the member best characterized with respect to biochemistry and enzymatics was chosen as query.

Obtained hits were analyzed with respect to their genomic context, i.e. the ordering of flanking genes (coding for further subunits of the enzyme complexes) in corresponding gene clusters (as previously described7), the nature of residues known to play critical roles in the respective catalytic reactions of the various enzyme families and structural idiosyncrasies of individual families6. In all retrieved cases, phylogenetic clustering coincided with these three additional phylogenetic markers. Many retrieved sequences turned out to be incorrectly annotated (see Table_sequences.xls in Supplementary Information).

When using automated multiple alignment algorithms, positions of roots within subtrees varied greatly with the set of sequences considered due to sequence-sample-dependent fluctuations in alignments. A close inspection of these alignments revealed a significant lack of robustness for nearly half of the total sequence length. The presence of extensive insertions/deletions (indels) characterizing individual families obviously hampers reliable alignments between subfamilies. To overcome this problem, we have resorted to structural alignments of the catalytic Molybdenum-subunit based on available 3D structures. X-ray structures have in fact been reported for members of 5 out of 6 of the enzyme families dealt with in this work, i.e. nitrate reductase (pdb entry 1Q16), arsenite oxidase (1G8J), polysulfide reductase (2VPZ), dimethylsulfoxide (DMSO) reductase (4DMR) and formate dehydrogenase (1KQF). As described previously for the case of the Rieske protein superfamily6, Mo-subunits were structurally superimposed using DeepView.

The severity of discrepancies between the structural approach and conventional alignment methods will be exemplified by a pair of enzymes, nitrate reductase (Nar) and polysulfide reductase (Psr). Figure 1A shows the 3D structural alignment of these two enzymes with Nar in purple and Psr in blue and green. The green structural elements in Psr denote the parts of the sequence where automated and 3D alignments yield equivalent results. Fig. 1B schematically indicates the deviations between sequence-only (Clustal and T-Coffee) and the structure-based alignments for the Nar/Psr pair. Among the automated alignment algorithms tested (Clustal, Muscle, T-Coffee and Mafft), T-Coffee scored slightly superior to the other methods (see Fig. 1B) while still aligning more than half of the sequence incorrectly. The structural alignment approach thus indeed yields results substantially different from those suggested by the automated alignment algorithms in sequence areas which do not feature very high sequence homology, that is, in the major part of the protein.

Figure 1
figure 1

(A) Structural superposition of the catalytic Molybdenum subunits from respiratory nitrate reductase (Nar, light and dark purple) and polysulfide reductase (Psr, blue and green). (B) Schematic sequence alignments between Nar and Psr. The second and third lines represent the alignment obtained fully automatically by ClustalX and T-Coffee whereas the bottom-line alignment is based on the structural information gathered from (A) and further processed as described above. In both (A) and (B), light purple stretches indicate parts of the Nar polypeptide chain for which no structural equivalents exist in Psr. The same applies for the light blue sequences stretches in Psr. The green stretches denote the part of the sequences for which Clustal- or T-Coffee- and structure-based alignments coincide.

The multiple sequence alignment of stretches for which the structural approach has shown structural equivalence was subsequently refined using ClustalX and Seaview. The family for which no X-ray structure has so far been reported, i.e. Arr, was first individually submitted to a multiple alignment procedure and the obtained alignment result subsequently profile-aligned to the bulk of structurally aligned sequences.

In summary, if remedying erroneous annotations was omitted, “messy” trees deviating from species phylogenies, very much like the one we obtain for the Arr enzyme, were obtained for the whole superfamily misleadingly indicating extensive lateral gene transfer. Exclusive reliance on sequence-based multiple alignments, on the other hand, resulted in roots sliding into either the archaeal or the bacterial domains in several subfamilies (e.g. in Nar), which would pretend a post-LUCA emergence of the corresponding enzyme followed by inter-domain transfer.

These two observations may rationalize the deviance of conclusions presented in a recent purely bioinformatic study of evolutionary histories of almost 4000 gene families10 from our results to be presented below. This analysis10 suggested a late appearance of Mo-based enzymes and in particular Nar (and consequently the denitrifying pathway). This latter conclusion is in conflict with molecular phylogenies of other enzymes involved in this pathway8,9 and appears unlikely in the light of several paleogeochemical data11,12,13 which suggest large scale abiotic generation and hence abundance of nitrogen oxides, the substrates of the denitrifying pathway, in the Hadean and early Archaean. Moreover, another bioinformatic survey of enzyme superfamilies also indicated the presence of the CISM-superfamily significantly before the Archaean/Proterozoic transitions14. However, this latter study does not allow conclusions to be drawn as to the absence or presence of these enzymes in LUCA.

In this context, we furthermore need to note that the vast majority of previously published phylogenies on the CISM superfamily suffers from the experimental flaw that in calculating trees the possibility of multiple substitutions was not taken into account. This approach would be valuable for relatively closely related sequences but is certainly erroneous when members from different subfamilies in both prokaryotic domains are compared to each other. The resulting trees suffer from weak resolution in deep branching and therefore preclude the kind of analysis we present in this work.

Composite phylogenetic tree of CISM superfamily enzymes

The phylogenetic tree encompassing several CISM subfamilies based on the 3D-structure assisted multiple alignment and taking multiple substitutions into account is shown in Fig. 2. Whereas specific subfamilies (see below) indeed appear to have originated only at a later stage, the majority of subfamilies show distinctive features indicating their presence in the last universal common ancestor (LUCA), that is, well before the rise of O2 in the environment1. Figure 2A shows a schematic representation of the composite phylogeny of several CISM-enzymes and the detailed tree is available as Supplementary Information.

Figure 2
figure 2

(A) 3D structure-based NJ-phylogenetic tree (see Supplementary Information) of the CISM-enzymes nitrate reductase (Nar), DMSO/TMAO reductase (Dms/Dor/Tor), arsenate reductase (Arr), formate dehydrogenase (Fdh), arsenite oxidase (Aro) and polysulfide reductase (Psr).Violet and orange denote eury- and crenarchaeal branches, dark green, cyan and light green stand for Proteobacteria, Firmicutes and other Bacteria, respectively. Open and closed dots indicate bootstrap values for the deep branchings exceeding 70 and 90%, respectively. In 3D structures, the Mo-subunit is in violet. (B) Redox potential range of relevant substrates and corresponding enzymes.

It is noteworthy that a given clade is tacitly assumed to correspond to an enzyme subfamily with a unique function. This inference is supported for the Nar and the Fdh clades by experimental evidence showing that archaeal and bacterial representatives perform the same chemical reaction. The presence of clade-specific particularities of substrate binding-site residues in all clades as mentioned above further corroborates the one-clade/one-enzyme notion.

Discussion

The clades representing DMSO/TMAO reductase (Dms/Dor/Tor) and arsenate reductase (Arr) contain only Bacteria and furthermore strongly diverge from 16S rRNA-based species trees. They therefore likely are late-emerging enzymes distributed predominantly via horizontal gene transfer. The clades corresponding to formate dehydrogenase (Fdh), polysulfide reductase (Psr), arsenite oxidase (Aro) and nitrate reductase (Nar) by and large resemble species trees, feature a prominent Archaea/Bacteria cleavage and their roots fall in between the archaeal and bacterial subtrees. The combined occurrence of these features strongly suggests the enzymes making up these clades to have been present in LUCA. The structural unit of the CISM protein thus appears to have served multiple purposes for life, especially in energy harvesting, right from its very beginnings. These results allow inferences on the available geochemical energy substrates. CISM enzymes in LUCA likely performed energy conversion through the reduction of carbon dioxide, polysulfide or nitrate as well as from the oxidation of arsenite. Reduction of CO2 and sulfur with H2 as electron donor would be viable bioenergetic pathways in the geochemical setting of the early Archaean and have indeed been put forward as ancestral bioenergetic mechanisms15,16. Bioenergetic oxidation of arsenite in the early Archaean has been proposed previously. The latter process requires sufficiently oxidizing e acceptors which may have been nitrogen oxides (ref. 8 and references cited therein), a scenario supported by the pre-LUCA character of Nar (Fig. 2), a key enzyme of the denitrification pathway. The ensemble of pre-LUCA bioenergetic CISM enzymes in fact suggests that LUCA's energy conserving processes tapped into electrochemical disequilibria approaching 1 V (Fig. 2B) and furthermore bears witness to the astonishing redox versatility of the Mo-cofactor.

This redox versatility in part certainly arises from the fact that molybdenum and tungsten are 2-electron redox compounds, that is, they can shuttle between the +4/+5 and the +5/+6 redox couples. However, it is precisely the property of Mo and W to feature 2-electron transitions which allows these two elements to perform energetically challenging redox conversions. As detailed in references 17 and 18, several 2-electron compounds under certain circumstances feature so-called crossed-over individual redox transitions which allows them to redox bifurcate electrons with one of the two reducing equivalents going seemingly uphill towards very low potential electron acceptors18. Specific members of the CISM family have been shown to feature crossed-over redox potentials19. It is tempting to hypothesize that the possibility to redox bifurcate electrons may play a crucial role in energetically challenging redox reactions such as the reduction of CO2 to formate in aceto- and methanogens. Both these reactions indeed rely on members of the CISM superfamily.

Apart from informing on the nature of primordial energy sources, the presence of the CISM superfamily in LUCA implies a vital role of its metal cofactors in early life. As mentioned, a few members of the superfamily use W rather than Mo and a few “Mo-enzymes” have been shown to insert W under Mo-depletion conditions or at high temperatures20. However, many other members have turned out to be specific for Mo.

What then of the availability of these two transition metals? W occurs in both acid and alkaline solutions and was thus available to emerging life21, whereas Mo is relatively insoluble in reduced and neutral waters5, but does occur in mixed valence sulfide and selenide and/or oxide complexes in alkaline solutions. Mo's insolubility at neutral pH values, exacerbated by an anoxic atmosphere1, suggested a low bioavailability of this element for early life1,3. Mo-isotope analyses on samples from the Archaean era indeed show substantially lower levels than during Phanerozoic times22. Two scenarios can reconcile the results of molecular phylogeny and paleogeochemistry. (i) The ancestral CISM enzyme exclusively used W which was later replaced by Mo. (ii) CISM-catalyzed reactions in early life used Mo supplied by alkaline hydrothermal vents, proposed as cradles for life23. The exclusiveness for Mo of many CISM-members as well as findings that primary productivity involving Mo has been comparable to the present since the geological record began at 3.8 Ga24 lead us to favor the second scenario.

The results described above thus require that certainly tungsten and most likely molybdenum ought to be added to the list of metals vital already to earliest life on Earth24. Both Mo and W are heavier than Fe and thus aren't bred in standard stellar fusion reactions. Their nucleosynthesis requires the much higher energies only attained in rapidly spinning gas giants and during supernova explosions. Intriguingly, astrophysical data indicate a crucial role of a nearby supernova in the birth of our solar system25. The origin of life on our and any other wet rocky world may thus have been assisted by (and potentially depended on) supernova-generated elements.

Methods

Sequence retrieval and multiple alignment

Amino acid sequences of CISM superfamily members were retrieved via BLAST searches on the NCBI-database (http://www.ncbi.nlm.nih.gov) using biochemically or structurally characterized representatives of the subfamilies nitrate reductase (Nar), polysulfide reductase (Psr), formate dehydrogenase-N (Fdh), DMSO reductase (Dms/Dor/Tor), arsenite oxidase (Aro) and arsenate reductase (Arr) as queries. The number of hits used for each major species phylum was limited in order to avoid unreadable trees induced by the strong oversampling of specific classes of prokaryotes in the databases (e.g. proteobacteria or pathogenic actinobacteria). Gene cluster context was assayed by analyzing the genomic environment of each sequence on the respective genome web-pages of the NCBI server.

Multiple alignments of recognized subfamilies were automatically produced using Clustal26, MEGA27, Mafft (http://mafft.cbrc.jp/alignment/software/about.html), Muscle28 and T-Coffee29 and in cases subsequently refined with respect to functionally conserved residues using Seaview30. The second and third line sequence alignments in Fig. 1B were performed using Clustal and T-Coffee, respectively.

The ensemble of enzymes for which 3D structures are available were aligned via DeepView31 (www.expasy.org/spdbv/; see Figure and alignment file in Supplementary Information)

Phylogenetic tree reconstruction

NJ-trees were reconstructed using ClustalX26 and MEGA-427. These two programs yielded tree topologies differing only in a few very late branching orders. The tree shown in Fig. 2A and available as phb-file in Supplementary Information has been obtained using ClustalX. Gap positions were taken into account. Multiple substitutions were allowed for using Kimura's correction algorithm26. Calculated bootstrap values correspond to the frequency of occurrence of nodes in 1,000 bootstrap replicates.