Characterization of Amino Acid Recognition in Aminoacyl-tRNA Synthetases

Genetic code and translation are key to all life. As a consequence, all kingdoms and species share the enzymes known as aminoacyl-tRNA synthetases, which link amino acids to their codons. For life to flourish, it is vital that these enzymes correctly implement the genetic code and hence correctly recognize amino acids. There are many theories on the emergence of aminoacyl-tRNA synthetases and their amino acid binding sites. While many insights have been gained from sequence analysis, the accurate amino acid recognition remains elusive. Here we use a novel approach to analyze all currently available aminoacyl-tRNA synthetase structures, which cover the recognition of all proteinogenic amino acids across all kingdoms of life. For the first time, we extensively characterize and quantify the interactions between aminoacyl-tRNA synthetases and bound amino acids. Our results show how different interaction features are used to delineate between the two major enzyme classes and the individual amino acids. Furthermore, we show that these features are conserved across a wide variety of species. The quantification of the similarity between the recognition of individual amino acids, allows to pinpoint where the genetic code is vulnerable to encoding errors and additional correction mechanisms had to evolve.


Introduction
One of the most profound open questions in biology is how the genetic code was established. While proteins are encoded by nucleic acid blueprints, decoding this information in turn requires proteins. The emergence of this self-referencing system poses a chicken-or-egg dilemma and its origin is still heavily debated 1, 2 . Aminoacyl-tRNA synthetases (aaRSs) implement the correct assignment of amino acids to their codons and are thus inherently connected to the emergence of genetic coding. These enzymes link tRNA molecules with their amino acid cargo and are consequently vital for protein biosynthesis. Beside the correct recognition of tRNA features 3 , highly specific non-covalent interactions in the binding sites of aaRSs are required to correctly recognize the designated amino acid [4][5][6][7] and to prevent errors in biosynthesis 5,8 . The minimization of such errors represents the utmost barrier for the development of biological complexity 9 and accurate specification of aaRS binding sites is proposed to be one of the major determinants for the closure of the genetic code 10 . Beside binding side features, recognition fidelity is controlled by the ratio of concentrations of aaRSs and cognate tRNA molecules 11 and may involve secondary structures 12,13 . triphosphate (ATP) and an aminoacyl-adenylate intermediate is formed 36,37 . In general, the binding sites of aaRSs can be divided into two moieties: the part where ATP is bound as well as the part where specific interactions with the amino acid ligand are established (Fig. 1). Is is assumed that the amino acid activation with ATP constituted the principal kinetic barrier for the creation of peptides in the prebiotic context 35 . Due to the fundamental importance of this first reaction step, highly conserved sequence 4 and structural motifs 38 exist, which are likely to be vital for the aminoacylation reaction. While the activation of amino acids with ATP is the unifying aspect of all aaRSs, the recognition mechanism of individual amino acids differs substantially between each aaRS. These differences are among the key drivers to maintain a low error rate during the translational process. Figure 1. The aaRS·tRNA complex and the architecture of its active site. The enzyme catalyzes the covalent attachment of an amino acid to the 3' end of a tRNA molecule. The binding site itself can be divided into two moieties. While the ATP moiety is responsible for constant fixation of ATP across all aaRSs 38 , the specificity-conferring moiety differs between each aaRS and forms highly specific non-covalent interactions with the amino acid ligand. gradual appearance of new amino acids and their incorporation into the genetic code. Figure 2. The genetic code relies on the specificity of aminoacyl-tRNA synthetases to ensure the correct mapping of amino acids to their codons. The aaRS enzymes disentangle the recognition space of amino acids to reduce errors during protein synthesis. This study provides a thorough characterization of the mechanisms that drive this specificity. We identified non-covalent interactions in the binding sites of aaRSs, binding site residue composition, editing mechanisms, and binding site volume as key determinants for specific amino acid recognition.

Results
Dataset Based on all available structures in the PDB, 424 (189 Class I, 235 Class II) three-dimensional structures of aaRSs co-crystallized with their corresponding amino acid ligands were analyzed. The selected data covers aaRSs of 56 different species in total, 180 from eukaryotes, 213 from bacteria, and 31 from archaea ( Supplementary Fig. S1). In total, 70 human structures are part of the dataset. Each protein chain that contains a protein-ligand complex of a catalytic aaRS domain was considered. Data was available for each of the 20 aaRSs, plus the non-standard aaRSs pyrrolysyl-tRNA synthetase (PylRS) and phosphoseryl-tRNA synthetase (SepRS). The numbers of protein-ligand complexes available for each aaRS are given in Supplementary Fig. S2. For twelve aaRSs, protein-ligand complexes were available in both pre-activation and post-activation reaction states, i.e. co-crystallized with either amino acid or aminoacyl ligand ( Supplementary Fig. S3).

Interaction Features
The frequencies of observed non-covalent binding site interactions in respect to the aaRS class and the type of interaction are shown in Tab. 1. In general, hydrophobic interactions are the most prevalent interactions for Class I aaRSs with a frequency of 44.60% in respect to the total number of interactions, while hydrogen bonds are most frequently observed in Class II aaRSs with 59.23% frequency. Five (hydrogen bonds, hydrophobic interactions, salt bridges, π-stacking, and metal complexes) interaction types were found in aaRSs. No π-cation interactions were found to be involved in amino acid binding. Water-mediated hydrogen bonds were excluded from analyses due to missing data for water molecules for the majority of the crystallographic structures.  Table 1. Overview of observed interactions between aaRSs and their amino acid ligands. The most prevalent interactions are hydrophobic interactions for Class I aaRSs and hydrogen bonds for Class II aaRSs (typeset in bold). Relative frequencies in respect to all interactions of the aaRS class are given in parentheses.

Amino Acid Recognition
The annotation of non-covalent protein-ligand interactions allowed to characterize interaction preferences of each aaRS at the level of individual atoms of their amino acid ligands. This analysis highlights the preferred modes of binding for each of the 22 amino acid ligands. Figure 3 shows the occurring interactions for each aaRS based on the analysis with PLIP. Each interaction

3/44
is annotated with its occupancy, i.e. the relative frequency of occurrence in respect to the total number of structures for this aaRS. Binding site features are neglected at this point and all interactions are shown in respect to the amino acid ligand. Figure 3. The recognition of individual amino acids by aaRSs mapped to their ligands. The ligands are grouped by physicochemical properties 43 and aaRS class. Different types of non-covalent protein-ligand interactions were determined with PLIP 41 and assigned to individual atoms of the ligand using subgraph isomorphism detection 44 . Backbone atoms of the ligand are depicted as circles without filled interior. The relative occupancy of each interaction in respect to the total number of investigated structures (number in parentheses for each aaRS) is given by pie charts. Interactions with an occupancy below 0.1 are neglected. Interactions for which a unique mapping to an individual atom is not possible due to ambiguous isomorphism, e.g. for the side chain of valine, were assigned to multiple atoms. The aaRSs conducting error correction via editing mechanisms are typeset in bold.
Class I In general, Class I aaRSs interact mainly via hydrogen bonds and hydrophobic interactions with the ligand. The backbone atoms of all Class I ligands feature hydrogen bonding with the primary amine group. The occupancy of this interaction is high throughout all Class I aaRSs, indicating a pivotal role of this interaction for ligand fixation. Additionally, the oxygen atom of the ligand's carboxyl group is involved in hydrogen bonding except for glutaminyl-tRNA synthetase (GlnRS), isoleucyl-tRNA synthetase (IleRS), and valyl-tRNA synthetase (ValRS). The same atom forms additional salt bridges in leucyl-tRNA synthetase (LeuRS), arginyl-tRNA synthetase (ArgRS), methionyl-tRNA synthetase (MetRS), and glutamyl-tRNA synthetase (GluRS). The side chains of the aliphatic amino acids leucine, isoleucine, and valine are exclusively bound via hydrophobic interactions. ArgRS and GluRS form salt bridges between binding site residues and the charged carboxyl and guanidine groups of the ligand, respectively. Glutamine is bound by GlnRS via conserved hydrogen bonds to the amide group and hydrophobic interactions with beta and delta carbon atoms. The two aromatic amino acids tyrosine and tryptophan are recognized by dedicated π-stacking interactions and extensive hydrophobic contact networks. Tryptophan is bound preferably from one side of its indole group at positions one, six, and seven. The sulfur atom of the cysteinyl-tRNA synthetase (CysRS) ligand forms a metal complex with a zinc ion in both structures. MetRSs bind their ligand with a highly conserved hydrophobic interaction with the beta carbon atom.
Class II Class II aaRSs consistently interact with the backbone atoms of the ligand via hydrogen bonds and salt bridges. The primary amine group forms hydrogen bonds with high occupancy and is involved in metal complex formation in threonyl-tRNA synthetases (ThrRSs) and seryl-tRNA synthetases (SerRSs). The carboxyl oxygen atoms of the ligands are bound by a combination of hydrogen bonding and electrostatic salt bridge interactions. The overall backbone interaction pattern is highly conserved within Class II aaRSs. Closer investigation revealed that a previously described structural motif of two arginine residues 38 , responsible for ATP fixation, seems to be involved in stabilizing the amino acid carboxyl group with its N-terminal arginine residue. The charged amino acid ligands in histidyl-tRNA synthetase (HisRS) and lysyl-tRNA synthetase (LysRS) form highly conserved hydrogen bonds with the binding site residues. Other specificity-conferring interactions include π-stacking interactions and hydrophobic contacts observed for phenylalanine-tRNA synthetase (PheRS), metal complex formation for ThrRS and SerRS with zinc, and salt bridges as well as hydrogen bonds for aspartyl-tRNA synthetase (AspRS). The amino acids alanine and proline are bound by alanyl-tRNA synthetases (AlaRSs) and prolyl-tRNA synthetases (ProRSs) via hydrophobic interactions. No specificity-conferring interactions can be described for the smallest amino acid glycine due to absence of a side chain. Hence, glycyl-tRNA synthetase (GlyRS) can only form interactions with the backbone atoms of the ligand. Furthermore, asparaginyl-tRNA synthetases (AsnRSs) mediate highly conserved hydrogen bonds with the amide group of their asparagine ligand. The non-standard amino acid pyrrolysine is bound by PylRS via several hydrogen bonds and hydrophobic interactions with the pyrroline group. SepRSs employ mainly salt bridge interactions to fixate the phosphate group of the phosphoserine ligand.  45 for a detailed discussion of editing mechanisms) in order to ensure proper mapping of amino acids to their cognate tRNAs. The similarity of interaction preferences depicted in Fig. 3 suggests that groups of very similar amino acids require editing mechanisms for their correct handling. Especially the three aliphatic amino acids isoleucine, leucine, and valine are bound via unspecific and weak hydrophobic interactions, substantiating the necessity of editing mechanisms observed for their aaRSs 46 . A similar trend can be observed, e.g., for AlaRS 47 in order to distinguish alanine from serine or glycine.
Binding Site Geometry and Volume We investigated binding site geometry and volume in order to quantify their potential contribution to amino acid recognition. Known editing mechanisms in aaRSs are focused on the prevention or correction of tRNA mischarging within one aaRS class (intra-class), e.g. the amino acids isoleucine, leucine, and valine belong to Class I. However, GluRSs and AspRSs have a highly similar interaction pattern of hydrogen bonds and salt bridges with the carboxyl group and weak hydrophobic interactions. Both aaRSs do not use editing and are handled by different aaRS classes. In this case, the geometry and size of the binding site can act as an additional layer of selectivity; a mechanism also exploited by ValRS 46,48 . To quantify the contribution of binding site geometry, seven structures of GluRS and six structures of AspRS were superimposed with respect to their common adenine substructure using the Fit3D 49 software. As this superimposition can solely be computed for protein-ligand complexes which resemble the post-reaction state, only a subset of the structures was used. The results show that the ligands of GluRSs and AspRSs are oriented towards different sides of a plane defined by their common adenine substructure (Fig. 4A). There is a significant difference (Mann-Whitney U p<0.01) in ligand orientation, described by the torsion angle between phosphate and the amino acid substructure of the ligand (Fig. 4B). Class I GluRSs feature a torsion angle of 54.64 ± 7.12 • , whereas the torsion angle of Class II AspRSs is -65.02 ± 7.40 • . Furthermore, the volume of the specificity-conferring moiety of the binding site (see Fig. 1) was estimated with the POVME 50

Interaction Patterns of Individual aaRSs
In addition to the investigation of interaction preferences from the ligand point-of-view, the binding sites of each aaRS were analyzed regarding the residues that form interactions with the amino acid ligand. The interactions were mapped to a unified sequence numbering for each aaRS, which is based on multiple sequence alignments (MSAs) (see Methods and Supplementary File S1). Original sequence numbers for each position can be inferred with mapping tables provided in Supplementary File S2. Figure 5A shows a sequence logo 51 representation of binding site interactions for AlaRS. Each colored position in the sequence logo represents interactions occurring at this position. Highly conserved interactions can be observed at renumbered On the protein side, this interaction is mediated by a conserved arginine residue that corresponds to the N-terminal residue of the previously described Arginine Tweezers motif 38 . Another prominent interaction is formed by valine at renumbered position 293. This residue interacts with the beta carbon atom of the alanine ligand via hydrophobic interactions. In some structures, this hydrophobic interaction is complemented by an alanine residue at renumbered position 325. Aspartic acid at renumbered position 323 is highly conserved in AlaRSs and seems to be involved in amino acid fixation via hydrogen bonding of the primary amine group. Overall, the specificity-conferring interactions with the small side chain of alanine are hydrophobic contacts. An example for amino acid recognition in AlaRSs is given in Fig. 5B. The structure of bacterial Escherichia coli AlaRS forms the whole array of observed interactions. Sequence logos of the remaining aaRSs are given in Supplementary Fig. S4 to S24. Based on the interactions between binding site residues and the ligand, a qualitative summary of specificity-conferring mechanisms and key residues was composed (Table 2). Moreover, the ligand size and count of observed interactions was checked for dependence. There is a weak correlation between the average number of interacting binding site residues for each aaRS and the number of all non-hydrogen atoms of the amino acid ligand (Pearson r=0.3520). This indicates that the number of formed interactions is mostly independent of the ligand size, i.e. smaller amino acids do not necessarily have a less complex recognition pattern. The distributions of interacting binding site residues for each aaRS are given in Supplementary Fig. S25.

Quantitative Comparison of Ligand Recognition
To allow for a quantitative analysis and comparison of ligand recognition between several aaRSs, interaction and binding site features were represented as binary vectors, so-called interaction fingerprints (see Methods). Based on these fingerprints, the Jaccard distance was computed for each pair of structures to represent the dissimilarity in ligand recognition. Subsequently, the Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) algorithm 52 was used for dimensionality reduction and embedding of the high-dimensional fingerprints into two dimensions for visualization. This embedding is considered to be the recognition space of aaRSs. Figure 6A shows the embedding results for all aaRSs in the dataset colored according to the aaRS classes. A Principal Component Analysis (PCA) of the same data is given in Supplementary Fig. S26. For each aaRS the average position of all data points in the embedding space was calculated and is shown as one-letter code label. Fig. 6B shows the same data colored according to the physicochemical properties of the amino acid ligand, i.e. positive (lysine, arginine, and histidine), aromatic (phenylalanine, tyrosine, and tryptophan), negative (aspartic acid and glutamic acid), polar (asparagine, cysteine, glutamine, proline, serine, and threonine), and unpolar (glycine, alanine, isoleucine, leucine, methionine, and valine).

6/44
Class I In terms of amino acid binding both aaRS classes seem to employ different overall mechanism; they separate almost perfectly in the embedding space. Especially aromatic amino acid recognition in Class I tryptophanyl-tRNA synthetases (TrpRSs) and tyrosyl-tRNA synthetases (TyrRSs) is distinct from Class II aaRSs and forms two outgroups in the embedding space. Remarkably, two different recognition mechanisms exist for TrpRSs, indicated by two clusters approximately at positions (-2.0,6.0) and (1.0,8.5) of the embedding space, respectively. The cluster at position (-2.0,6.0) is formed by structures from bacteria and archaea, while the cluster at position (1.0,8.5) is formed by eukaryotes and archaea and is in proximity to TyrRSs. Closer investigation of two representatives from these clusters shows two distinct forms of amino acid recognition for TrpRSs. Human aaRSs employ a tyrosine residue in order to bind the amine group of the indole ring, while prokaryotes employ different residues ( Supplementary Fig. S27). The Class I aaRSs that are closest to Class II are GluRSs and CysRSs. A cluster of high density is formed by Class I IleRS, MetRS, and ValRS, which handle aliphatic amino acids. This indicates closely related recognition mechanisms and difficult discrimination between these amino acids.
Class II For Class II aaRSs the recognition space is less structured. Nonetheless, clusters are formed that coincide with individual Class II aaRSs, e.g. a distinct recognition mechanism in AlaRSs. The aaRSs handling the small and polar amino acids threonine, serine, and proline are closely neighbored in the embedding space. Recognition of GlyRSs seems to be diverse; GlyRSs are not grouped in the embedding space. However, the recognition of glycine, which has no side chain, is limited by definition and thus the fingerprinting approach might fail to capture subtle recognition features. AspRSs and AsnRSs are located next to each other in the embedding space. Their recognition mechanisms seem to be very similar as the only difference between these two amino acids is a single atom in the carboxylate and amide group, respectively.

Mechanisms That Drive Specificity
In order to quantify the influence of different aspects of binding site evolution on amino acid recognition by aaRSs, different interaction fingerprint designs were compared against each other. Each design includes varying levels of information and combinations thereof: binding site composition (Seq), protein-ligand interactions (Int), editing mechanisms (Ed), and binding site volume (Vol). To assess the segregation power of each fingerprint variant, the mean silhouette coefficient 53 over all data points was calculated. This score allows to assess to which extent the recognition of one aaRS differs from other aaRSs and how similar it is within its own group. Perfect discrimination between all amino acids would give a value close to one, while a totally random assignment corresponds to a value of zero. Negative values indicate that the recognition of a different aaRS is rated to be more similar than the recognition of the same aaRS. Figure 7 shows the results of this comparison. When using pure sequence-based fingerprints (Seq sim ), the mean silhouette coefficient over all samples is -0.0510, which indicates many overlapping data points and unspecific recognition. By including interaction information (Seq, Int) the value increases to 0.1361. If editing is considered (Seq, Int, Ed), a further improvement with a silhouette coefficient of 0.2731 can be observed. Adding volume information (Seq, Int, Ed, Vol) slightly increases the quality of the embedding to 0.2757. The silhouette coefficients for editing-and volume-based fingerprints were calculated as baseline comparison. If only editing information (Ed) is considered the mean silhouette coefficient amounts to -0.3027. For binding site volume fingerprints (Vol) the mean silhouette coefficient is -0.4682.

Relation to Physicochemical Properties of the Ligands
In order to investigate whether the fingerprinting approach is a simple encoding of the physicochemical properties of the amino acids, the results were related to experimentally determined phase transfer free energies for the side chains of amino acids from water (∆G w>c ) and vapor (∆G w>c ) to cyclohexane 3, 54 . These energies are descriptors for the size and polarity of amino acid side chains and underlie both, the rules of protein folding and the genetic code 55 . The Spearman's rank correlation between pairwise distances for each aaRS in the recognition space and physicochemical property space is weak with ρ=0.2564 and p<0.01 (see Supplementary Fig. S28). This indicates that the fingerprinting approach used in this study is a true high-dimensional representation of the complex binding mechanisms of amino acid recognition in aaRSs. This assumption is supported by a PCA (Supplementary Fig. S26) of the fingerprint data, where the first two principal components account for only 9.24% and 8.44% of the covered variance, respectively.

Discussion
The correct recognition of individual amino acids is a key determinant for evolutionary fitness of aaRSs and considered to be one of the major determinants for the closure of the genetic code 10 . The results of this study emphasize the multitude of mechanisms that lead to the identification of the correct amino acid ligand in the binding sites of aaRSs. Based on available protein structure data, a thorough characterization of binding site features and interaction patterns allowed to pinpoint the most important drivers for the correct mapping of the genetic code. The main findings of this analysis can be summarized as follows: (i) Class I and Class II aaRSs employ different overall strategies for amino acid recognition. (ii) Interaction patterns and binding site composition are the most important drivers to mediate specificity. However, very similar amino acids require additional selectivity through steric effects or editing mechanisms. (iii) The analysis of interaction fingerprints suggests that error-free recognition is a delicate task and a complex interplay between binding site composition, interaction patterns, editing 7/44 mechanisms, and steric effects. The results point towards a gradual diversification of amino acid recognition and, hence, a gradual extension of the genetic code.

Class Duality Extends Possibilities
The aaRS class duality allowed to broaden the amino acid recognition space significantly. In general, the recognition of amino acids with low side chain complexity seems to be complemented by allosteric interactions and cannot be exclusively implemented by configuring side chains. Although the volumes of Class I and Class II binding sites differ significantly, they are probably not the major determinants for amino acid selectivity. In general, Class I aaRSs handle larger amino acids 3 and thus the binding site volume of Class I aaRSs is expected to match the volumes of their larger ligands. Nonetheless, binding site volume and geometry may act as additional layers of selectivity. An example are the two negatively charged amino acids glutamic acid and aspartic acid, handled by a Class I and Class II aaRS, respectively. In this case, overall interactions are highly similar but binding geometry and binding site volume is significantly different. Both ligands are attacked from the opposite side 56 as highlighted by significantly different conformations (Fig. 4B). There is evidence that both amino acids were among the first to exist in the prebiotic context [57][58][59][60][61][62] . It is conceivable that the discrimination between glutamic and aspartic acid was based on superordinate secondary structures elements and size selectivity rather than on specific side chain interactions. This is supported by the observation that ancient proteins, based on a limited set of amino acids, were still capable to exhibit secondary structures [62][63][64] . One can only speculate whether a simultaneous emergence of two different aaRS classes and secondary structure formation allowed to incorporate these early -but highly similar -amino acids into the genetic code. According to the biochemical pathway hypothesis 57 , GluRS and AspRS might have been the first Class I and Class II representatives, with other aaRSs evolving from them 57,65,66 . However, the decreased usage of aspartic acid and the enrichment of glutamic acid in modern organisms, compared to the LUCA, points towards a different direction 67 . According to these usage frequencies, aspartic acid was incorporated into the genetic code prior to glutamic acid. This temporal order was equally concluded by the evaluation of various criteria to derive a consensus order of amino acid appearance 68 .
Glutamine and Asparagine Followed Glutamic Acid and Aspartic Acid Glutamine and asparagine are chemically closely related to glutamic and aspartic acid, respectively. It is likely that GlnRSs 6 and AsnRSs 7 mutually co-evolved from GluRSs and AspRSs through early gene duplication 15 . Although the ligands of GluRS and GlnRS are rather similar, interaction patterns and binding site compositions differ between these two enzymes. Hence, evolution found a way to distinguish between these amino acids because no editing mechanisms are involved 69 and glutamic acid is recognized by exploiting its negative charge 70,71 . These differences coincide with the analysis of the recognition space (Fig. 6), where GluRSs and GlnRS are not neighbored in the embedding. In contrast, AspRS and AsnRS are directly neighbored and seem to share a similar recognition mechanism. However, the discrimination between aspartic acid and asparagine depends on a water molecule that forms a water-assisted hydrogen bonding network in the active site of AsnRS 72 . The vicinity in the recognition space might be due to the limitation of interaction data, for which co-crystallized water molecules were not available for the majority of the structures and thus not considered during analysis.
Distinct Recognition of Arginine and Lysine Another interesting example are the two positively charged amino acids lysine and arginine. Interaction data suggests two unrelated ways to achieve ligand recognition in Class II LysRSs and Class I ArgRS, i.e. the two enzymes are well separated in the embedding space. The poor editing capabilities for LysRS regarding arginine 73 might have required a good separation of the two recognition mechanisms. Even if a relation of ArgRSs to aaRSs of hydrophobic amino acids was proposed 74 , a separate subclass grouping for ArgRSs 15 seems to be reasonable and is in accordance with the observed data; the recognition mechanism differs substantially from the hydrophobic amino acids.

Glycine Recognition is not Interaction-Driven
Based on interaction data, the recognition of the smallest amino acid glycine seems to be rather unspecific; a large spread in the embedding space can be observed for individual protein-ligand complexes of GlyRS. This is to be expected as GlyRS is known to maintain its specificity not due to interactions with glycine -it has no side chain to interact with -but rather due to active site geometry that blocks larger amino acids 10,75 .
Alanine Recognition is Crucial Alanine is the second smallest amino acid with only a single heavy side chain atom. The idiosyncratic architecture of AlaRS is different from other Class II aaRSs 76 . Still, the confusion with glycine and serine 47 , or non-proteinogenic amino acids 8 , poses a challenge for correct recognition of alanine and a loss of specificity is associated with severe disease outcomes 77 . The recognition mechanism in AlaRSs seems to differ substantially from other Class II aaRSs (see Fig. 6), indicating evolutionary endeavor to develop a unique recognition mechanism.

Discrimination of Hydrophobic Amino Acids Requires Editing
The hydrophobic amino acids isoleucine, leucine, valine, and methionine are considered to have entered the genetic code at the same time 20,58,74 . The highly similar interaction patterns for IleRS, ValRS, and MetRS substantiate this assumption. Due to their difficult discrimination, editing functionality is key 5,48,69,78,79 for these aaRSs.

8/44
Tryptophan Recognition Suggests Late Addition to the Genetic Code The emergence of TrpRSs and TyrRSs is considered to have happened at a later stage of evolution. The two aaRSs are likely to be of common origin 37 and constitute their own subclass, which is supported by sequence and structure studies 15,18,19,80,81 . PheRS supposedly evolved from same the precursor as TrpRS and TyrRS 21 . In general, TrpRSs and TyrRSs separate well from other aaRSs in the recognition space, which is likely due to the unique utilization of π-stacking interactions with binding site residues. Beside specific interactions in the binding site, allosteric effects and interdomain cooperativity 82,83 are drivers for TrpRS specificity. Furthermore, mutations in the dimerization interface of TrpRSs were shown to reduce specificity 84 . Remarkably, two distinct ways of recognition are apparent for TrpRSs in bacteria and eukaryotes. These differences support the previous described separation of eukaryotic TrpRSs and TyrRSs from their prokaryotic counterparts 85 and late addition of these amino acids to the genetic code 86 . However, structures from archaea do not follow this pattern and feature both recognition variants.

Conclusion
Understanding the complex evolutionary history of aaRSs and their inherent relation to the origin of the genetic code still poses a scientific challenge. Thorough sequence 22 and structure analyses 15 of aaRSs are major stepping stones towards unraveling the formation of the genetic code. In this study, structures of aaRSs co-crystallized with their amino acid ligands were used in order to describe the mechanisms of specific amino acid recognition both qualitatively and quantitatively. Specific amino acid binding is vital for the function of aaRSs and thus more conserved than global structure and sequence 87 . Consequently, the study of non-covalent binding site interactions and geometric characteristics is essential in order to understand enzymatic function and evolution 88 . For the first time, the characteristics of amino acid binding were described for 22 different aaRSs across all kingdoms of life. The carefully distilled information about important residues and interactions in the binding sites of aaRSs can serve as a valuable resource for future studies such as engineering aaRSs in order to extend the genetic code with non-natural amino acids 89,90 . Additionally, knowledge about specificity-conferring interaction patterns might be exploited in order to develop drugs that inhibit aminoacylation in pathogenic species 91 or to understand the functional consequences of disease-causing mutations 92,93 .

Data Acquisition
The dataset from our last study 38 served as the basis for all analysis. As all structures in the dataset are annotated with ligand information, only entries containing ligands relevant for amino acid recognition were considered, i.e. they bind to the specificity-conferring moiety of the binding site (see Fig. 1). Every protein chain of the entry was considered that: (i) comprises a catalytic aaRS domain, (ii) contains a co-crystallized specificity-relevant ligand in the active site, and (iii) the ligand must contain an amino acid substructure. Filtering of the data resulted in 189 (235) structures for Class I (Class II) aaRSs that contain ligands with relevance for specificity. The number of structures in respect to the pre-or post-activation state of the catalyzed reaction is shown in Supplementary Fig. S3. Furthermore, sequences of the dataset entries were clustered using single-linkage clustering with a sequence identity cutoff of 95% according to a global Needleman-Wunsch 94 alignment with BLOSUM62 substitution matrix computed with BioJava 95 . Representative chains for each cluster were selected, preferring wild type and low resolution structures. In total, 47 (54) protein chains were selected to be representatives for Class I (Class II) aaRSs. The dataset covers structures of all known aaRSs from organisms across all kingdoms of life ( Supplementary Fig. S1).

Mapping of Sequence Positions
Amino acid sequences were derived from the set of representative structures of the respective aaRS. To allow a unified mapping of sequence positions, an MSA was computed for each aaRS using the T-Coffee 96 Expresso pipeline. The quality of each MSA in the specificity-conferring region of the binding site was assessed regarding the correct mapping of the Backbone Brackets and Arginine Tweezers structural motifs 38 , and the conservation of the respective sequence signature motifs 4,22 . All MSAs preserved the considered regions and passed the quality checks. Supplementary File S1 contains all MSAs in FASTA format. The sequence positions for each aaRS were then unified according to the resulting MSA in order to investigate conserved interaction patterns. For this purpose the custom script "MSA PDB Renumber", available under open-source license (MIT) at github.com/vjhaupt, was used. Supplementary File S2 contains tables that allow to infer original sequence positions for each structure in the dataset.

Annotation of Non-Covalent Protein-Ligand Interactions
Non-covalent protein-ligand interactions were annotated for all entries in the dataset that contained a valid ligand using PLIP v1.3.3 41 with default parameters.

Determination of Interactions Relevant for Specificity
Only interactions formed between the amino acid substructure of the ligand and binding site residues were considered for analysis. For this purpose subgraph isomorphism detection with the RI algorithm 44 was applied. The RI implementation of the SiNGA framework v0.5.0 97 was used. Each amino acid scaffold was represented by a graph created from the amino acid's SMILES string taken from PubChem 98 . The full amino acid graph was modified using MolView v2.4 (available at molview.org) in order to remove the terminal hydroxyl group, which is cleaved during the enzymatic reaction and must thus be ignored for subgraph matching. For each dataset entry that contained a valid ligand, the corresponding amino acid graph was matched against the ligand in order to identify the atoms involved in the formation of specificity-conferring interactions. A depiction of the workflow to determine specificity-conferring interactions is given in Fig. 8.

Generation of Interaction Fingerprints
To allow for a quantitative comparison of recognition mechanisms, each protein-ligand complex was represented by a structureinvariant binary interaction fingerprint (see for example the paper of Salentin et al. 40 about the idea of interaction fingerprinting). Different fingerprint designs were chosen for comparison: a simple 20-dimensional fingerprint on binding site composition and a 500-dimensional fingerprint based on binding site composition and interaction information. The latter was further enriched with editing and binding site volume information.

Simple Binding Site Based Fingerprints
Binary and structure-invariant fingerprints that represent binding site compositions (used as baseline for the comparison of different fingerprint designs, Fig. 7) were constructed as follows. Each residue predicted to be in contact with any specificityrelevant atom of the ligand was considered for fingerprint generation. A 20-dimensional binary vector was used to represent the occurrence of individual residue types in the binding site. For each of the interacting residues the corresponding bit was set to active. Hence, multiple occurrences of the same residue type were not taken into account.

Binding Site and Interaction-Based Fingerprints
Single three-dimensional vectors of non-covalent interactions were encoded into a binary vector by considering the type of interaction, the interacting group in the ligand and the interacting amino acid residue. One such feature could be a hydrogen bond between an oxygen atom in the ligand and tyrosine in the protein. Each of these features is hashed to a number between 1 and 500 so that the resulting fingerprint has 500 bits.

Encoding of Editing Mechanisms and Binding Site Volume
Information about the editing mechanisms performed by some aaRSs were taken from the paper of Perona and Gruic-Sovulj 45 and encoded by appending a 22-dimensional bit vector to the 500-dimensional fingerprint. Each active bit represents a ligand against which editing is performed, e.g. for structures of ThrRS the bit for serine is set. In addition to editing information the binding site volume, estimated with the POVME 50 algorithm, was encoded. Twelve bins were created that represent binding site volumes ranging from 30-270 Å 3 in steps of 20 Å 3 . For example, if a structure has a binding site volume of 45 Å 3 the first bit was set to active. For a binding site volume of, e.g., 52 Å 3 the second bit was set to active and so on. The fingerprints were concatenated to contain the binding site and interaction features (500 bits), editing mechanisms (22 bits), and binding site volume (12 bits). The final fingerprint has a size of 534 bits.

Embedding of Interaction Fingerprints
To allow for a quantitative comparison of the interactions between individual aaRSs, the high-dimensional interaction fingerprints were embedded using UMAP version 0.3.2 52 . The parameters for all embeddings given in this manuscript were set as follows: n_neighbors = 60, min_dist = 0.1, n_components = 2. The Jaccard distance was used to describe the dissimilarity between two fingerprints a and b: d(a, b) = 1 − n a∧b n a + n b − n a∧b (1) with n a∧b being the count of active bits common between fingerprints a and b, n a the number of active bits in fingerprint a, and n b the number of active bits in fingerprint b. This distance metric was used as input for UMAP.  Table 2. Overview of specificity-conferring recognition mechanisms for all aaRSs grouped by aaRS class and subclass 15 .
Only interactions with side chain atoms of the amino acid ligand were included in this summary. HB is hydrogen bond, SB is salt bridge, HP is hydrophobic, MC is metal complex, and PS is π-stacking interaction. Correspondences between interactions and residues are indicated by superscript letters.            A key difference in ligand binding can be observed for a residue that binds the amino group of the indole ring. In human TrpRSs (PDB:1r6u chain A) a hydrogen bond with tyrosine is formed, while Geobacillus stearothermophilus (PDB:1i6l chain A) employs aspartic acid. Figure S28. Phase transfer free energies of amino acid side chains 3, 54 from water (∆G w>c ) and vapor (∆G w>c ) to cyclohexane compared to the recognition space analysis from this study. Each data point represents the euclidean distance between every combination of two amino acids in the phase transfer diagram given in Carter and Wills 55 against the euclidean distance in the recognition space proposed in this study. Spearman's rank correlation is ρ = 0.2564 with p<0.01.

43/44
File S1. multiple_sequence_alignments.zip An archive file containing the MSA files of representative structures for each aaRS that were used for consistent renumbering. The alignments were computed with the T-Coffee expresso 96 pipeline and are stored in FASTA format.
File S2. renumbering_tables.zip An archive file containing Excel tables to infer original sequence positions from renumbered positions for each aaRSs. Rows are renumbered positions, columns are sequence positions of individual structures.