Process calculi may reveal the equivalence lying at the heart of RNA and proteins

The successful use of process calculi to specify behavioural models allows us to compare RNA and protein folding processes from a new perspective. We model the folding processes as behaviours resulting from the interactions that nucleotides and amino acids (the elementary units that compose RNAs and proteins respectively) perform on their linear sequences. This approach is intended to provide new knowledge about the studied systems without strictly relying on empirical data. By applying Milner’s CCS process algebra to highlight the distinguishing features of the two folding processes, we discovered an abstraction level at which they show behavioural equivalences. We believe that this result could be interpreted as a clue in favour of the highly-debated RNA World theory, according to which, in the early stages of cell evolution, RNA molecules played most of the functional and structural roles carried out today by proteins.


Symbols and their transliteration
The following tables explain the symbols used to describe processes and actions of the proposed models; the transliterations of process names are necessary to construct the LTS representations as well as to perform the model checking and the bisimulation games through the automated tool CAAL (Concurrency Workbench, Alborg Edition).

Models Construction
In the models of the folding process that we have defined, the weak interactions are classified in three main categories: • hydrogen bonds; • electrostatic interactions (ionic and van der Waals); • hydrophobic interactions.
The hydrogen bond could be defined as an electrostatic interaction, but due to its distinctive properties and the fundamental role it carries out in the folding process, it has been represented separately.
All the weak interactions listed above have been modelled to formally describe the whole folding process. Each folding process always starts from a linear sequence (of nucleotides in RNAs and of amino acids in proteins) and is driven by the reduction in free energy between two different folded configurations.
To better clarify this concept, we can imagine the folding process as a sequence of folding steps, each contributing to the entire process with a new weak interaction between two units of the sequence (equally for RNAs and proteins). In order for a folding step to take place, the weak interaction must cause a reduction in the free energy of the system, which means that the folding step must have a negative ∆G. The ∆G variation during folding is represented as a process that can produce three possible outputs: negative, a positive or zero ∆G.

Base pairing
In RNA, hydrogen bonds allow the pairing between two bases. According to Watson-Crick base pairing, adenine (A) always pairs with uracil (U) with two hydrogen bonds, while guanine (G) always pairs with cytosine (C) with three hydrogen bonds. At the same time, the non-conventional base pairing shows various combinations of the four RNA bases, forming two hydrogen bonds (or even only one); it is not infrequent to find in RNA also a triple base pairing (indeed, it is possible that a unique base quartet forms between G-C base pairs at the junction of two helices).
The hydrogen bond formation (in both Watson-Crick and Wobble base pair) has been modelled generalising that process as an interaction between a purine (adenine or guanine -labelled dr, since they are double-ring bases) and a pyrimidine (uracil and cytosine -single-ring bases and hence labelled sr) or between a two paired bases and a third base (also in this case, a generic purine or pyrimidine). The base pairing is symmetric, thus: srdr = drsr.
To remove some details not necessary for the aim of the model, it has also been opted for another generalisation, not explicitly representing all the possible interaction between a couple of paired bases and a third base, but indicating this process as a "triple base-pairing" (P b3 ) and its output as "three paired bases" (tpb).
For the same reason the formation of the G-C base quartet is not treated in the model. Regarding the number of hydrogen bonds allowed in a base pair, in our models they must be at least two and at most three; the number of hydrogen bonds that link an unpaired base to a group of two already paired bases must be from one to three. It has been decided to limit the minimum number of hydrogen bonds in a base pair (to the number of two) because base pairs with a single hydrogen bond can be classified as a variant of the primary types and because the whole number of hydrogen bonds found in a base triplet is at least three.
Moreover, because up to now the only known base pair that involves three hydrogen bonds is the one between cytosine and guanine, only the srdr base pair is allowed in the model to form a triple hydrogen bond; this means that also AU, GU and CA base pairs could potentially form a triple hydrogen bond, which is a stretch of the current knowledge on hydrogen bonding. Since this property is important for the stability of the RNA molecules, we want to better justify the proposed abstraction: if we want to capture the constraint of limiting the formation of three hydrogen bonds only to the GC base pair, we should represent explicitly every bases and their combination instead of the convention adopted; this would reduce the readability of our models to capture a property that not affect the main purpose for which they were created.
The Base Pairing process (P b2 ) takes two unpaired bases (ub) as input and provides as output the two bases paired only if it can form at least two hydrogen bonds (hb) between them. P b2 is a sub-process of a general F s rna (RNA Folding Step) process, from which it receives its input (the F s rna process will be described later in this section); it is one of the possible sub-processes that give to each folding step its specificity. As explained in the article, each folding step, and therefore each base pairing process, is conditioned by the value of the ∆G: it can take place only if its ∆G is negative.
The Triple Base pairing process (P b3 ) takes as input a couple of bases, paired by the P b2 process, and a third unpaired base (ub) and provides as output a group of three paired bases (tpb). The number of hydrogen bonds that can be generated in this process is at least one and at most three.
Like the P b2 process, P b3 is a sub-process of F s rna and depends on the value of the ∆G (the output of the ∆ G F s process) to take place.

5/15
The following is the specification of the P b2 and the P b3 processes using Milner's CCS (in the subsection 2.4 on page 8 they will be contextualised in the complete description of the F s rna process): B1 b2 , B2 b2 , B3 b2 (base hydrogen bond) and B1 b3 , B2 b3 , B3 b3 (three bases hydrogen bond) are states that allow counting the number of the hydrogen bonds.
In proteins, an hydrogen bond can form between the amino group of one amino acid and the carboxyl group of another. Every amino acid has an amino group and a carboxyl group covalently linked to the alpha (central) carbon. In the rest of this document, the terms "amino groups" and "carboxyl groups" will refer specifically to such functional groups. In contrast with the base pairing of nucleotides, only a single hydrogen bond is allowed between two amino acids; however, there is no limitation in the length of a sequence of amino acids linked to one another via hydrogen bonds.
Therefore, two amino acids can hydrogen bond to each other only if they meet the following conditions: • the interaction has a negative ∆G; • the amino group of one of the two interacting amino acids and the carboxyl group of the other are both free (not involved in an hydrogen bond).
The Amino Acids Pairing process (P aa ) is a subprocess of the general F s p (protein folding step), as P b2 is a subprocess of F s rna . P aa takes two amino acids (aa) as input, makes an hydrogen bond between the free amino group of the first one (aa1fnh) and the free carboxyl group of the second one (aa2fco) or between the free carboxyl group of the first amino acids (aa1fco) and the free amino group of the second one (aa2fnh); then, provides a group of two paired amino acids (paa -paired amino acids) as output.
It is important to notice that: 1. although the distinction between "first" and "second" amino acid might appear unnecessary when they are both unpaired, it has to be specified to deal with the situation in which at least one of the two amino acids is already involved in an hydrogen bond through one of its functional groups; 2. when the P aa process receives two amino acids as input, we have the certainty that an hydrogen bond will form, because the negative ∆G of the interaction has already been checked in the early phases of the F s p process.

6/15
The following is the CCS specification of the P aa process: NH aax and CO aax (where x is 1 or 2) are state that indicate the selection of the free amino group or of the free carboxyl group (respectively) of the x-th amino acid.

Electrostatic interactions
Two particles electrically charged can interact according to the Coulomb's law. The model of the folding process does not investigate the interactions at atomic level, therefore the details of this law will not be covered. What we need to know is that two elementary units of either an RNA or a protein sequence, can interact in a folding step if they are both charged and if the ∆G of such interaction is negative. The main purpose of this kind of interactions is to stabilise the folded structure reached through the previous steps.
The electrostatic interaction can be of two types: ionic and van der Waals. The ionic interactions cause the formation of a weak bond between two ions of opposite charge; the van der Waals interactions occur between two molecules oppositely polarised.
The modelling of these interactions is basically the same in both RNA and Protein folding: given as input a couple of bases (in the RNA model) or amino acids (in the Protein model), each unpaired or already paired, the electrostatic interaction process allows the nondeterministic choice between a ionic interaction (ii) or a van der Waals interaction (vdwi), which are produced as output.
The Bases Electrostatic Interaction process (I e b ) specifies the electrostatic interactions in the RNA folding model: The Amino Acids Electrostatic Interaction process (I e aa ) specifies the electrostatic interactions in the protein folding model: I e b is a subprocess of F s rna ; I e aa is a subprocess of F s p .

Hydrophobic interactions
Water is a polar solvent, this means that it easily dissolves charged or polar compounds, which are called, for this reason, hydrophilic (from Greek, "water-loving"). In contrast, nonpolar molecules are hydrophobic.
In RNA, the purine and pyrimidine bases are hydrophobic and relatively insoluble in water, while the backbone of alternating ribose and phosphate groups is hydrophilic.
To minimize contact of the bases with water and stabilizing the three-dimensional structure of the RNA, during the folding process, the backbone is placed on the outside of the molecule, facing the surrounding water, while the bases are positioned inside, stacked with the planes of their rings parallel to each other (a process called hydrophobic stacking interaction).

7/15
In the RNA folding model, the Bases Hydrophobic Interaction process (I h b ) takes two bases as input, produces an hydrophobic interaction for both of them (hbi) and provides as output the same bases buried inside the RNA (bb) and stacked to each other (sb).
Since I h b is a subprocess of F s rna , the fact that the ∆G of the interaction is negative has already been checked in the earlier phases of the latter process.
The CSS specification of the I h b process is: In proteins, the specific characteristics of an amino acid are determined by the properties of its R group; the polarity of that group varies widely, from non-polar and hydrophobic to highly polar and hydrophilic. Hydrophobic amino acid side chains tend to be clustered in the protein's interior, away from water, while hydrophilic side chains remain on the protein surface. Folding of a polypeptide chain thus creates an "inside" and an "outside" and generates buried and exposed amino acid side chains. The interior of a protein is generally a densely packed core of hydrophobic amino acid side chains.
The hydrophobic interactions in proteins do not exhibit the stacking phenomenon, therefore the Amino Acids Hydrophobic Interaction process (I h aa ) takes only one amino acid as input. Then, if the amino acid side chain is hydrophilic (hlsc), it is exposed outside the protein (esc), if the side chain is hydrophobic (hbsc), it is buried inside the protein (bsc).
The inside and the outside of the protein are identified by the states I p and O p respectively. I h aa is a subprocess of F s p . The following is the CCS specification of the I h b process:

Folding step
Now that we have described the model of each weak interaction in both RNA and protein, it is possible to contextualise these models in the folding step they belong to (F s rna or F s p ) . Each step represents an iteration which allows the nondeterministic choice of one of the possible weak interaction subprocess.
F s rna and F s p ensure that each subprocess complies with the specific restrictions on its input (according to the descriptions made above in this section) and that the interaction has a negative ∆G (and hence can be carried out).

8/15
The CCS specification of the whole F s rna process is the following: F s rna def = ub.I1 n + ub.I2 n + srsr.I1 n + drdr.I1 n + srdr.I1 n + tpb.I1 n ; I1 n and I2 n (nucleotide interaction) are states that allow the selection the right subprocess on the basis of its permitted inputs.
∆ G P b2 (base pairing delta G), ∆ G P b3 (triple base pairing delta G), ∆ G I e b (bases electrostatic interaction delta G) and ∆ G I h b (bases hydrophobic interaction delta G) processes check that the ∆G of the related interaction is negative.

9/15
The CCS specification of the whole F s p (Protein folding step) process is: I1 aa is a state that allows the selection of the subprocesses that take two amino acids as input. ∆ G P aa (amino acids pairing delta G), ∆ G I e aa (amino acids electrostatic interaction delta G) and ∆ G I h aa (amino acids hydrophobic interaction delta G) processes check that the ∆G of the related interaction is negative.

RNA folding and protein folding
In order to meet the requirement that each interaction must have a negative ∆G, both the F s rna and F s p processes are placed in parallel composition with the ∆ G F s (folding step delta G) process, defining in this way the overall folding process (F rna and F p respectively). 10/15

Model checking
It is possible to verify that the biochemical properties of the folding processes are satisfied by the above-described model. We propose here four examples, expressing some properties as HML formulas and establish if they are satisfied performing the model checking.

Two unpaired bases (ub) can form an hydrogen bond (hb) if the ∆G of the interaction is negative (ndg):
F s rna ub ub ndg hb tt; 2. with a single hydrogen bond it is not possible to form a base pair (srsr, drdr, srdr): 3. it is possible to form a group of three paired bases (tpb) with only a single hydrogen bond (between an unpaired base and a group of two already paired basessrsr in this case); obviously, the ∆G of the interaction must be negative: 4. if an amino acid has an hydrophobic side chain (hbsc), it has to be buried inside (bsc) and not exposed outside (esc) the protein: The verification that these formulas are satisfied was made with the aid of the model checking function of the web-based tool CAAL. The results are shown in Figure 1.

Higher abstraction level model
We might therefore wonder if there is an abstraction level at which the two folding processes would show a behavioural equivalence. As it will be proved in this article, this level of abstraction can actually be defined. Its construction, however, requires a generalisation of the weak-interaction processes and the imposition of some limitations to the "expressiveness" of the protein folding process.
The first of the two aforementioned modification can be achieved by: • redefining nucleotides and the amino acids as general elementary units, which can be paired or unpaired; • abstracting from the specificity of each pairing process by no longer taking into account the number of hydrogen bonds formed between two (or three) paired units; • generalising the hydrophobic interactions to their key feature of burying the hydrophobic molecules while exposing the hydrophilic ones (no longer considering the stacking process typical of the hydrophobic interactions of nucleotides).
These adjustments to the model do not affect the main property of each weak interaction, therefore the model is still faithful to the biological process. However they are not sufficient to obtain a behavioural equivalence between the folding processes of RNAs and proteins.
What we still need to do is limiting the folding capability of the proteins by reducing the number of amino acids that can interact through hydrogen bonds to the number of three (the maximum number of nucleotides that can pair in RNAs).
With these considerations in mind, we can rewrite the above model of the folding process.
Base pairing process P b2 takes two unpaired units (uu) as input (from the F s rna process) and produces a paired unit (pu) as output. The label hb not indicates a single hydrogen bond, but stands for the overall interaction based on hydrogen bonding.
B sr B sr , B dr B dr , B sr B dr are states that specify the type of base pair of the produced paired unit.

Triple base pairing process
The P b3 process takes an unpaired unit (uu) and a paired unit (pu) as input (from the F s rna process) and produces a triple unit (tpu) as output.
The state U b3 (triple base unit) indicates that an hydrogen bonding interaction (possibly made by more than one hydrogen bond) has taken place.

Amino acid pairing process
The P aa process takes two unpaired units (uu) as input (from the F s p process) and produces a paired unit (pu) as output. As the same for the base pairing process, the label hb not indicates a single hydrogen bond.
The states NC and CN (where N and C stand for amino group and carboxyl group respectively) allow the preservation of the right complementarity of the hydrogen bond interaction between amino acids.

Triple amino acid pairing process
This is a new process (not present in the previous model); it is necessary to limit the capabilities of amino acids to hydrogen-bond with each other; as for the base pairing, at this level of abstraction at most three amino acids can be connected by the same hydrogen bonding interaction (not to be confused with a single hydrogen bond).

12/15
The P aa3 process takes an unpaired unit (uu) and a paired unit as input and produces a triple unit (tpu) as output.

Electrostatic interaction
The base electrostatic interaction (I e b ) and the amino acid electrostatic interaction (I e aa ) processes are unchanged compared with the previous model (see Section 2.2 on page 7).

Nucleotide hydrophobic interaction
Since the hydrophobic stacking is no longer considered in the new model, the hydrophobic interaction can affect a single nucleotide per iteration (folding step).
The process, renamed I h n , takes one unpaired unit as input and buries inside the RNA its hydrophobic component (hbc − bc) while exposes outside the RNA its hydrophilic component (hlc − ec).

Amino acid hydrophobic interaction
The I h aa process takes one unpaired unit as input and buries inside the protein its hydrophobic component (hbc − bc) while exposes outside the protein its hydrophilic component (hlc − ec). In this case the "component" is a generalisation of the side chain, this means that each unpaired unit taken as input can have an hydrophobic or an hydrophilic component (but not both).

Folding step
The F s rna and F s p perform the same tasks as in the previous model (see Section 2.4 on page 8).

13/15
The CCS specification of the whole modified F s rna process is: F s rna def = uu.I1 n + pu.I1 n + uu.∆ G I h n + uu.I2 n + tpu.I1 n ; I1 n def = uu.∆ G I e b + pu.∆ G I e b + tpu.∆ G I e b ; I2 n def = uu.∆ G P b2 + pu.∆ G P b3 ; I1 n and I2 n (nucleotide interaction) are states that allow the selection the right subprocess on the basis of its permitted inputs.
∆ G P b2 (base pairing delta G), ∆ G P b3 (triple base pairing delta G), ∆ G I e b (bases electrostatic interaction delta G) and ∆ G I h n (nucleotide hydrophobic interaction delta G) processes check if the ∆G of the related interaction is negative.

14/15
The CCS specification of the whole modified F s p process is the following: I1 aa and I2 aa are states that allow the selection of the right subprocess on the basis of its permitted inputs. ∆ G P aa (amino acids pairing delta G), ∆ G P aa3 (triple amino acids pairing delta G), ∆ G I e aa (amino acids electrostatic interaction delta G) and ∆ G I h aa (amino acids hydrophobic interaction delta G) processes check if the ∆G of the related interaction is negative.
The folding processes are still defined as the parallel composition of the folding step process and the folding step ∆G (see Section 2.5 on page 10).