Distribution of Purines and Pyrimidines over miRNAs of Human, Gorilla and Chimpanzee

Meaningful words in English need vowels to break up the sounds that consonants make. The Nature has encoded her messages in RNA molecules using only four alphabets A, U, C and G in which the nine member double-ring bases (adenine (A) and Guanine (G)) are purines, while the six member single-ring bases (cytosine (C) and uracil (U)) are pyrimidines. Four bases A, U, C and G of RNA sequences are divided into three kinds of classifications according to their chemical properties. One of the three classifications, the purine-pyrimidine class is important. In understanding the distribution (organization) of purines and pyrimidines over some of the non-coding regions of RNA, all miRNAs from three species of Family Hominidae (namely human, gorilla and chimpanzee) are considered. The distribution of purines and pyrimidines over miRNA shows deviation from randomness. Based on the quantitative metrics (fractal dimension, Hurst exponent, Hamming distance, distance pattern of purine-pyrimidine, purine-pyrimidine frequency distribution and Shannon entropy) five different clusters have been made. It is identified that there exists only one miRNA in human hsa-miR-6124 which is purely made of purine bases only. AMS Subject Classification: 92B05 & 92B15


Introduction
The monomer units for forming the nucleic acid polymers deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) are nucleotides which are grouped into three different classes based on their chemical properties, i.e., purine group R={A,G} and pyrimidine group Y={C,T/U}; amino group M={A,C} and keto group K={G,T/U}; strong H-bond group S={C,G} and weak H-bond group W={A,T/U} [1]. There are two kinds of nitrogen-containing bases -Purines and Pyrimidines, first isolated from hydrolysates of nucleic acids, were identified using classical methods of organic chemistry. An important contribution was made by Emil Fischer who must be credited with the earliest synthesis of purines (1897) [2]. Purines consist of a nine member double-ring (containing carbon and nitrogen) fused together, where as pyrimidines have a six member single-ring comprising of carbon and nitrogen. [3,4]. The organization of purine-pyrimidine bases over RNA is crucial. Here we intend to understand the organization of the two chemical bases purine and pyrimidine over over some of the non-coding RNAs, microRNA.
MicroRNAs (abbreviated miRNAs) contain about 18-25 ribonucleotides that can play important gene regulatory roles by pairing to the messages of protein-coding genes, to specify messenger RNA (mRNA) cleavage or repression of productive translation [5,6,7]. miRNA genes are one of the more abundant classes of regulatory genes in animals, estimated to comprise between 0.5 and 1 percent of the predicted genes in worms, flies, and humans, raising the prospect that they could have many more regulatory functions than those uncovered to date [8,9]. The main function of miRNAs is to down-regulate gene expression [10]. A given miRNA may have hundreds of different mRNA targets, and a given target might be regulated by multiple miRNAs [11,12,13,14,15]. It is important to identify the miRNA targets accurately. miRNAs control gene expression by targeting mRNAs and triggering either translation repression or RNA degradation [16,17]. Their aberrant expression may be involved in various human diseases, including cancer [18,19,20,21,22,23,24]. miRNA regulatory mechanisms are complex and there is still no high-throughput and low-cost miRNA target screening technique [25,26,27]. It is an well known fact that each miRNA is potentially able to regulate around 100 or more targeted mRNAs and 30% of all human genes are regulated by miRNAs [28].
In this article an attempt has been made to decipher the pattern of organization of purines and pyrimidines distributions over the miRNAs of the three species human, gorilla and chimpanzee. We desire to understand how the purine/pyrimidine bases are organized over the sequence and how much distantly the purine/pyrimidine bases can be placed over the sequence. Which one of these two type of chemical bases purine or pyrimidine dominates other in terms of their frequency density over the sequence is one of our prime aims to comprehend. A simple binomial distribution (i.e. location independent occurance of the bases) fails to describe the observed variation of Purine and Pyrimidine. This encourages us to look for further patterns. We investigate, the self-organization of the purine and pyrimidine bases for all the miRNAs of the three species human, gorilla and chimpanzee through the fractal dimension of the indicator matrix. The auto correlation of purine-pyrimidine bases over the miRNAs through the parameter Hurst exponent are determined and found many of the miRNAs have identical auto correlation even if their purine-pyrimidine organization is different. All the miRNAs are compared about their nearness based on their purine-pyrimidine distribution, Hamming distance is employed among all the miRNAs in understanding the nearness of purine-pyrimidine organization. The purine-pyrimidine distance patterns including the frequency distribution have been found for all the miRNAs for all three species. All possible distinct pattern of frequency distribution are determined for all the miRNAs of both the species. Here we wish to bring attention to the reader that through our investigation, the miRNA h1291, made of only purine bases is identified. There is no miRNA (human, gorilla and chimpanzee) which is absolutely made of pyrimidines.

Datasets
From the MiRBase (a miRNA database: http://www.mirbase.org/) [29], from the family Hominidae, total of 2588 mature miRNAs of human, 357 mature miRNAs of gorilla and 587 mature miRNAs of chimpanzee are taken. Each miRNA of human, gorilla and chimpanzee are encoded as numbers starting from h1 to the total number of sequences h2588 for miRNAs of human and same has been made for miRNAs of gorilla g1 to g357 and for miRNAs of chimpanzee p1 to p587 (Additional file 1 ). We then transform the miRNAs sequences (A, U, C, G) into binary sequences (1's and 0's) according to the following rules: That means purine and pyrimidine nucleotide bases are encoded as 1 and 0 respectively into the transformed binary sequences of miRNAs. Therefore, presently we have three datasets of binary sequences of the three species human, gorilla and chimpanzee. All the computational codes are written in Matlab R2016a software.

Fractal Dimension of Indicator Matrices
Here we shall encode each binary sequences into its indicator matrices [30,31]. It is noted that there are several other technique for finding fractal dimension anf selforganization structure of DNA sequences [32,33]. Consider a set S = {0, 1} and an indicator function f : This indicator function can be used to obtain the binary image of the binary sequence as a two dimensional dot-plot. The binary image obtained by this indicator matrix can be used to visualize the distribution of ones and zeros within the same binary sequence and some kind of auto-correlation between the ones and zeros of the same sequence. It can be easily drawn by assigning a black dot to 1 and a white dot to 0. An example of indicator matrix is shown in Fig. 1 for the binary sequence Hsa − miR − 576 − 3pM IM AT 0004796: 1111010111111100111100. From the indicator matrix, we can have an idea of the "fractal-like" distribution of ones and zeros (purines and pyrimidines). The fractal dimension for the graphical representation of the indicator matrix plots can be computed as the average of the number p(n) of 1 in the randomly taken n × n minors of the N × N indicator matrix. Using p(n), the fractal dimension (FD) is defined as

Hurst Exponent of Binary Sequences
The Hurst Exponent (HE) deciphers the long term memory of time series, i.e. the autocorrelation of the time series [34,35,36]. A value of 0 < HE < 0.5 indicates a time series with negative autocorrelation and a value of 0.5 < HE < 1 indicates a time series with positive autocorrelation. A value of HE = 0.5 indicates a true random walk, where it is equally likely that a decrease or an increase will follow from any particular value with higher values indicating a smoother trend and less roughness. The Hurst exponent HE of a binary sequence {x n } is defined as ( n 2 ) HE = R(n) Hamming Distance of Binary Sequences The Hamming Distance (HD) between two binary strings is the number of bits in which they differ [37,38,39]. Since length of the miRNAs might differ and hence a special care has been taken into consideration. Suppose there are two miRNAs S m and S n of length m and n respectively (n > m), then where S n of length n window is sliding over S m from the left alignment to the right alignment and each time hamming distance (hd) is calculated, and finally minimum hd value is taken as hamming distance HD of two binary sequences.
For example, take two binary sequences S m = 010100 and S n = 1101, now sliding of S n over S m of length 4, from left to right alignment of these two sequences, we find the hamming distances are hd(010100,1101)=1, hd(010100,1101)=3, hd(010100,1101)=2, therefore we take HD = 1 (minimum) of these two binary sequences. Finding the minimum hamming distance of the two binary sequences says about the maximum similarity of two sequences over the distribution of purines and pyrimidines. The minimum value of HD = 0 when the pattern of length min(m, n) of two binary sequences of miRNAs are exactly identical i.e. similar distribution of purines and pyrimidines over the miRNAs of the two sequences and the maximum value of HD = min(m, n) when the pattern of length min(m, n) of two binary sequences of miRNAs are exactly opposite i.e. completely dissimilar distribution of purines and pyrimidines over miRNAs two sequences.
Distance pattern of purine and pyrimidine over miRNAs Here we are exploring the distance pattern of purines bases across the miRNAs human, gorilla and chimpanzee. How sparsely (closely) purine bases are placed over the miRNAs. So we found the distance (gap) between purine bases to the immediate next purine base over the miRNA sequences. For example, take a transformed binary sequence S m = 110100111000001, where 1 indicates the purines bases and 0 indicates pyrimidine bases in the sequence. From left to right the positions of 1's and 0's in serial is shown below. Now, form the distribution of 1's, we find the purine distances at 1 (two consecutive 1's at a distance of 1: 11), 2 (two consecutive 1's at a distance of 2: 101), 3 (two consecutive 1's at a distance of 3: 1001) and 6 (two consecutive 1's at a distance of 6: 1000001). So, the distance pattern of purines over the sequence is [1 2 3 6] in order. Similar to the distance pattern of purine, the distance pattern of pyrimidine bases (0's) across the miRNAs also can be determined. The distance pattern of pyrimidines of the above sequence is [1 2 4] in order. Further the distance pattern of pyrimidines [1 2 4] of a miRNA opens up a fact that there is at least one 1(= 2 − 1) and at least one 3(= 4 − 1) length purine blocks present in the miRNA. In the similar way, a distance pattern of purine triggers the presence of pyrimidine blocks in miRNAs.

Shannon entropy of miRNAs
The Shannon entropy (SE) is defined as the entropy of a Bernoulli process with probability p of one of two values [40,41,42]. It is defined as Here p 1 = k 2 l and p 2 = l−k 2 l ; here l is length of the binary string and k is the number of 1s in the binary string of length l.
The Shannon entropy is a measure of the uncertainty in a binary string. To put it intuitively, suppose p = 0. At this probability, the event is certain never to occur, and so there is no uncertainty at all, leading to an entropy of 0. If p = 1, the result is again certain, so the entropy is 0 here as well. When p = 1/2, the uncertainty is at a maximum; if one were to place a fair bet on the outcome in this case, there is no advantage to be gained with prior knowledge of the probabilities. In this case, the entropy is maximum at a value of 1 bit. Intermediate values fall between these cases; for instance, if p = 1/4, there is still a measure of uncertainty on the outcome, but one can still predict the outcome correctly more often than not, so the uncertainty measure, or entropy, is less than 1 full bit.

Results and Classifications of miRNAs
Deviation from randomness A simple random binomial (p,q) model [43], where each entries can either be purine (or pyrimidine) with probability p (or q = 1 − p) fails to address the distribution of purine or pyrimidine over miRNAs. If we calculate the mean of the distribution from the sample and calculate the probability p, and if we calculate the probability p from the variance then we get different values. In all three species, the expected variances are significantly smaller than what we would have expected from the binomial distribution.

Classification Based on FDs of Indicator Matrices
The fractal dimension for each binary sequence of human, gorilla and chimpanzee miRNAs as entire data is given in the Additional file 2 (sheet-1). Based on the fractal dimension, we have made classifications for the all the datasets of human, gorilla and chimpanzee. There are 10 clusters of miRNAs for the species human, gorilla and chimpanzee as shown in Table 1. The fractal dimensions including the histograms of all the miRNAs of human, gorilla and chimpanzee are plotted in the Fig. 2. Also a normal distribution fitting is also made as shown in Fig.3. The detail members (miRNAs) of the clusters for human, gorilla and chimpanzee are given in the Additional file 2 (sheet-2). It is observed that the FD of miRNAs of human lies in the interval [1.51, 1.9] and the largest cluster(center at 1.60) contains 931 miRNAs whereas the FD of miRNAs of gorilla lies in the interval [1.55, 1.864] which is contained in the interval [1.51, 1.9] and the FD of miRNAs of chimpanzee lies in the interval [1.54, 1.83] and the largest cluster (center at 1.58) contains 255 miRNAs. The largest cluster (center at 1.60) of miRNAs of gorilla contains 151 members. The centers of all the three largest clusters of miRNAs of human, gorilla and chimpanzee are approximately same which is a reflection of the fact that all these three species of the family Hominidae belong to the same family and are evolutionarily close. It is noted that there is no miRNAs of gorilla whose FD lies in between 1.76 and 1.84 whereas there are approximately 72 miRNAs of human and 7 miRNAs of chimpanzee whose FD lies in the said interval. There are clusters (for human, gorilla and chimpanzee) with largest centers among the other centers of the clusters contain 5, 2 and 3 members respectively.

Classification Based on HEs
For every sequence of miRNA of human, gorilla and chimpanzee, Hurst exponent are determined (Additional file 3 (sheet-1)) and then a classification is made based on the computed Hurst exponent for miRNAs of human, gorilla and chimpanzee, which is shown in Table 2. There are 10 clusters for all the species human, gorilla and chimpanzee miRNAs as shown in Table 2. The Hurst exponents and the histograms of all the miRNAs of human and gorilla are plotted in the Fig. 4. Also a normal distribution fitting is also made as shown in

Classification Based on HDs
The detail pairs of miRNAs based on Hamming distances of human, gorilla and chimpanzee are given in the additional file-4 (sheet-1), additional file-4 (sheet-2) and additional file-4 (sheet-3) respectively . We then form classes of pairs of the binary strings (miRNAs) of the three species human, gorilla and chimpanzee based on Hamming distances 0 to 21 as shown in Table 3. The bar plots with pie plots of these class frequencies are also given the Fig. 6. For all the three cases of miRNAs of human, gorilla and chimpanzee, none of the clusters with Hamming distances from 0 to 21 is empty. It is also seen that the largest clusters with HD 9 for miRNAs of human, gorilla and chimpanzee contain 1042752, 23076 and 56794 pairs respectively. It interprets that the arrangement of the purine and pyrimidine bases for most of miRNAs of human, gorilla and chimpanzee are differed by 9 bases only.

Classification Based on Distance Pattern of Purine and Pyrimidine
For all the miRNAs of human, gorilla and chimpanzee, the distance patterns between purine bases to the next immediate purine bases are obtained. There are 187, 45 and 64 clusters based on unique distinct patterns of purine bases distance (gap) of miRNAs of human, gorilla and chimpanzee respectively Additional file-5 (sheet-1,2,3). There is no miRNAs of gorilla and chimpanzee having purine distance pattern [1]. In all the three cases, it is noted that there are several clusters having only one member which means that the mirRNA of those clusters has unique purine distance pattern. In the similar fashion, for all the miRNAs of human, gorilla and chimpanzee, the distance between pyrimidine bases to the next immediate pyrimidine bases are found, which is tabulated in the Additional file-5 (sheet-1,2,3). The maximum length of the purine and pyrimidine distance patterns is found to be 5 for all the three set of miRNAs of human, gorilla and chimpanzee except five miRNAs only for human. We also have determined the frequency distribution of purine and pyrimidine bases of the miRNAs of human, gorilla and chimpanzee as presented in detail in the Additional file-6. The clusters based on the frequency distribution of purine is made and tabulated in Table 4. The histogram of frequency distribution of purine bases for the three species are given in Fig. 8.

Classification Based on SEs
For all the miRNAs of the three species human, gorilla and chimpanzee of the Hominidae family, the Shannon entropy is calculated of which detail can be seen in the Additional file 7. In the case of miRNAs of human, there are exactly 80 distinct SE values are obtained whereas in the case of miRNAs of gorilla and chimpanzee, there are 38 and 57 respectively distinct SEs are found. Based on the Shannon entropy, the miRNAs of the three species are classified into 10 clusters separately as shown in the Table 5. It is observed that most of the miRNAs of all the species human, gorilla and chimpanzee are having Shannon entropy centered at 0.96 which is the largest center of the clusters for all the three different clustering, which contain 1753, 300 and 446 miRNAs respectively. It is seen that there are 16 and 4 miRNAs of human and chimpanzee respectively centered at 0.5 (SE) but there is no member having same center 0.5 in the other set of miRNAs of gorilla. It is also seen that there is no miRNA of human and gorilla, whose Shannon entropy centered at 0.37 but there is only one miRNA of chimpanzee is having SE centered at 0.37. It interprets that none of the miRNAs of gorilla is having equal (approximately) purine-pyrimidine distribution over the sequences whereas there are 16 miRNAs of human which are approximately having random-like purine and pyrimidine distribution. Overall these observations draw an impression that almost none of the miRNAs of all three species is having random-like purine-pyrimidine distribution.

Discussion
The purine and pyrimidine analysis with the binomial distribution shows the purines and pyrimidines are not independently distributed over the miRNAs and there is a tendency of same properties (purine or pyrimidine) to repeat in a miRNA. We found the various classes using different methods where most of the cases the classes are normally distributed although the distribution of the purines and pyrimidines is not random like distribution.
There are 5 miRNAs of human in the cluster 10 based on fractal dimension as shown in the Table-1 having maximum FD. We have seen closely those sequences and found that three of them (h2248, h1954 and h2552) are pyrimidine rich sequences (94%, 90% and 90% respectively) and other two (h1835 and h1291) are purine rich sequence (95% and 100 %) as shown in Table-6. In the case of miRNAs of gorilla and chimpanzee, the cluster 10 contains 2 and 3 miRNAs respectively. Here we convict that whenever the amount of purine or pyrimidine is highly rich in the miRNA sequence then corresponding FD will be maximum.
There are several clusters having miRNAs for human, gorilla and chimpanzee with the same HEs. The frequency distribution of purine and pyrimidine are balanced for all such miRNAs having same HEs. For an example, we took 17 miRNAs from the cluster 7 of miRNAs of human based on Hurst exponent (h463, h526, . . . , h1824 and h2202) which all are having same HE (0.714484) as shown in the Table -7. It is found that the frequency distribution of purine and pyrimidine 60% and 40% (or 40% and 60%) respectively. It is also seen that these 17 miRNAs of human belong to same cluster 3 based on fractal dimension. It is observed that there are 6% miRNAs of human and chimpanzee and 14% miRNAs of gorilla having HE 0.5 which indicates a completely uncorrelated purine and pyrimidine spatial ordering over the miRNAs. Now we shall see the miRNAs of human having identical distance pattern of purine and pyrimidine. It is observed that there are very few numbers of miRNAs of human which are having identical distance pattern of purine and pyrimidine. For example, h61 miRNA is having identical distance pattern of purine and pyrimidine [1 2]. Also There are miRNAs h36, h51, h62, h83, h122 and h2584 having identical distance pattern of purine and pyrimidine [1 2 3]. There are miRNAs of gorilla g6, g13, g19, g20 and g21 having identical distance pattern of purine and pyrimidine [1 2 3 4]. In the set of miRNAs of chimpanzee, there are p55, p84, p106, p119 and p138 having identical purine and pyrimidine distance pattern. This identical distance pattern of purine and pyrimidine make a guarantee that there are purine and pyrimidine block of same length.
There are 142 distinct frequency distribution of purine and pyrimidine bases across miRNAs of human are found. Out of 2588 miRNAs of human, 194 miRNAs of human having equal frequency distribution of purine and pyrimidine and 1121 miRNAs of human having lesser frequency of purines than that of pyrimidine of 1273 miRNAs are there. This infers frequency of pyrimidine is richer than that of purine over the set of miRNAs of human. On the other hand, there are 43 miRNAs out of 357 miRNAs of gorilla having equal frequency distribution of purine and pyrimidine. There are 58 distinct frequency distribution of purine and pyrimidine bases across miRNAs of gorilla. Out of 357 miRNAs of gorilla, 146 miRNAs of gorilla having lesser frequency of purines than that of pyrimidine of 168 miRNAs are there. There are 92 distinct frequency distribution of purine and pyrimidine bases across miRNAs of chimpanzee are found. Out of 587 miRNAs of human, 67 miR-NAs of chimpanzee having equal frequency distribution of purine and pyrimidine and 268 miRNAs of human having lesser frequency of purines than that of pyrimidine of 252 miRNAs are there. This infers frequency of pyrimidine is richer than that of purine over the set of miRNAs of chimpanzee. In this regard we infer that the frequency distribution of pyrimidine over the miRNAs of these three observed species human, gorilla and chimpanzee is richer than that of the purine bases. Out of all the 2588 miRNAs, the miRNA h1291 is only miRNA containing all purine bases.
It is reported that MiR-200 is a family of tumour suppressor miRNAs consisting of five members (h609, h888, h1449, h1677, h1520, h2186, h515, h328 and h670), which are significantly involved in inhibition of epithelial-to-mesenchymal transition (EMT), repression of cancer stem cells (CSCs) self-renewal and differentiation, modulation of cell division and apoptosis, and reversal of chemoresistance [18] as shown in Table-8. We have chosen all these nine miRNAs of human including other miRNAs of human which are 0, 1 and 2 Hamming distance apart from those nine miRNAs.
It is found that h888 is 1 Hamming distance apart from h609 and h1520 although h609 and h1520 are 2 Hamming distance apart. The miRNA h888 is having approximately same HD, HE and frequencies of purine and pyrimidine bases with the miRNAs h609 and h1520. Hence we convict that h888 might also work as h609 and h1520 do. It is also observed that h1670 is 2 HD apart from the miRNAs h609 and h515 and the miRNA h1670 is showing very close as per quantitative measures and hence this miRNA h1670 would function as h609 and h515 do. There are two miRNAs of human h1449 and h1677 are 0 Hamming distance apart with same quantitative measures and hence we firmly propose that these two miRNAs would functions same. It is worth noting that both the miRNAs h1449 and h1677 have identical purine-pyrimidine organization. Following the similar argument, other association with the rest of miRNAs can also be made. It is seen that there does not exist any human miRNA which is 0, 1 or 2 HD apart from the miRNA h328.

Concluding Remarks
One of the integral divisions of nucleotides based on their chemical properties is purine-pyrimidine. We attempted to understand the distribution of purine and pyrimidine bases over all the miRNAs in three species human, gorilla and chimpanzee. Quantitatively, we deciphered the self-organization of the purine and pyrimidine bases for all the miRNAs through the fractal dimension of the indicator matrix. Also we took out the auto correlation of purine-pyrimidine bases through the parameter Hurst exponent. To get the nearness of the miRNAs based on their purine-pyrimidine distribution, HD is deployed. The purine-pyrimidine distance patterns including the frequency distribution have been found for all the miRNAs. For all these parameters, we did cluster the miRNAs into several clusters. Based on the quantitative investigation, some crucial observations are adumbrated in the discussion. It is worth noting that through our investigation, the miRNA h1291, made of only purine bases is identified.
Through these investigations what it fails to attain is the association among the miRNAs and target mRNAs as shown in Table 9, Table 10 and Table 11. The complex relationship among miRNAs and target mRNAs, previously pointed out in literatures [12] has been reconfirmed through our quantitative analysis over a set of miRNAs and target mRNAs. Imperfect base-pairing between the miRNA and the 3'-UTR of its target mRNA leads to blockage of translation, or at least accumulation of the mRNA's protein product, whereas perfect or near-perfect base-pairing between the miRNA and the middle of its target mRNA causes cleavage of the mRNA, thereby inactivating the same. Such diverse patterns of miRNA action may be responsible for making the correlation among miRNA and target mRNA complex, that are yet to be resolved decisively as pointed out by several researchers earlier. In this context, we plan to bring out the pattern of organization of nucleotides following the other two modes of classifications (amino-keto and strong H-bond and weak Hbond) based on the chemical properties of nucleotides in order to integrate the whole three kinds of grouping to find out the correlationship among the miRNAs-mRNAs and corresponding targeted regions of mRNAs.