One novel representation of DNA sequence based on the global and local position information

One novel representation of DNA sequence combining the global and local position information of the original sequence has been proposed to distinguish the different species. First, for the sufficient exploitation of global information, one graphical representation of DNA sequence has been formulated according to the curve of Fermat spiral. Then, for the consideration of local characteristics of DNA sequence, attaching each point in the curve of Fermat spiral with the related mass has been applied based on the relationships of neighboring four nucleotides. In this paper, the normalized moments of inertia of the curve of Fermat spiral which composed by the points with mass has been calculated as the numerical description of the corresponding DNA sequence on the first exons of beta-global genes. Choosing the Euclidean distance as the measurement of the numerical descriptions, the similarity between species has shown the performance of proposed method.

The graphical and numerical representation of DNA, RNA or protein sequences has become the popular strategies to analyze the evolutionary relationship between species. As the availability of varies gene data for different species, the comparison of different organisms that own unique genetic information involves in mathematics, biology, physics, informatics and so on. Many researchers have focused on the issue of representation of gene sequence, as seen in  , so the study of representation of gene sequence is significant and beneficial.
Hamori and Ruskin 32 first proposed the H-curve, the graphical representation of nucleotide sequence, which is convenient for the visual analysis and comprehension of the DNA sequences. Following them, further researches of representation of DNA sequence were carried [33][34][35][36][37][38][39][40][41][42][43][44] . For example, Zhang 45 proposed a five-color map visualization of DNA sequences named ColorSquare. Jafarzadeh 1 constructed the C-curve with no loss of information. And Aram 5 introduced a new graphical representation of the DNA sequences which called spider representation. Moreover, Bielinska-Waz 10 represented the sequence with a set of discrete lines which referred to as the B−spectrum. Unfortunately, owing to the high degeneracy and loss of information and the need of a lot of space in the transformation of DNA sequence to graphical representation, the performances of many methods are not satisfactory as expected.
To solve those problems, we present one novel representation of DNA sequence based on global and local position information. Distinct from previous reports, the more effective representation is obtained and the possible effect caused by different length of DNA sequence is restrained by new method. In detail, the novel concept of representation of DNA sequence involves (1) formulating the graphical representation of DNA sequence according to the curve of Fermat spiral which remaining the global position information of the original sequence, (2) taking the local position information of DNA sequence into consideration according to attach each point in the curve of Fermat spiral with the related mass, (3) the normalized moments of inertia of the curve of Fermat spiral which composed by the points with mass has been calculated as the description of the corresponding DNA sequence on the first exons of beta-global genes.

Graphical representation of DNA sequence
In order to make full use of global information of DNA sequence, the original DNA sequence is divided into four subsequences constituted by A, C, G or T that four point sets correspondingly can be obtained by the position of nucleotide in the original DNA sequence. Thus, each nucleotide in the subsequence corresponds to one point in the set. With the operation by distributing each point set to the curve of Fermat spiral, four corresponding curves which means the graphical representation of DNA sequence can be plotted. The reason that we choose the Fermat spiral instead of the circle as the distribution curve of subsequence is that the curve of Fermat spiral is the monotonically increasing functions in the polar coordinate system which can remaining the information of position of the original sequence.
We regard the DNA sequence as BS(base sequence) which is constituted by four subsequences of AS, CS, GS and TS. Concretely, the i-th nucleotide in BS is denoted as V i BS , i = 1, 2, ···, N BS . It is obvious that the length of nucleotide in base sequence is equal to the total length of nucleotide in four subsequence, described as: where N BS , N AS , N CS , N GS and N TS respectively denote the length of nucleotide in base, A, C, G and T subsequence. For the purpose of plotting the base curve of Fermat spiral corresponding to the base sequence, the coordinate of points in the polar coordinate system are calculated according to the information of position in the base sequence.
For each point, calculated as: where θ V i BS denotes the polar angle of nucleotide V i BS in the polar coordinate system; L is one constant which means the shortest length of DNA sequence for different species in the experience; L V i BS denotes the position of nucleotide V i BS in the base sequence which ranging from 1 to N BS . The mathematical formula of the curve of Fermat spiral is described as: As for the nucleotides in the base sequence, the corresponding set of coordinate for each point in the polar coordinates are calculated as Correspondingly, four subsets can be obtained and plotted. As shown in Fig. 1, the graphical representation of the first exons of β-globin gene of human DNA gene is plotted.

Attaching each point with a mass
In order to make full use of carried information of DNA sequence, the local characteristics are taken into consideration to attach each point corresponding to the nucleotide in the base sequence with a mass. Since one of immediate 5′ neighbor nucleotide and two of immediate 3′ neighbor nucleotides were considered as the context to calculate the mass of point corresponding to the second nucleotide in the group, the times and the compactness that the second nucleotide occurs and arranges are considered as the criterion to confirm the mass of the second nucleotide in the group.
According to the times that the nucleotide same as the second position repeats in the group, four categories may be divided. As shown in the following the nucleotide being same as the second nucleotide is denoted as 1 and the nucleotide being different from the second nucleotide is denoted as 0.
(1) 0100 (2)   is attached to the point corresponding to the nucleotide of second position. However, for the purpose of reducing the impact of DNA sequence which is too long, the mass of latter sequence after L are restrained as

Numerical Representation
For the widespread application of the moment of inertia in many gene numerical representation method 10,11,15,16 , the normalized moments of inertia for each massive sub-curve of Fermat spiral are calculated as the numerical representation of formal DNA sequence in this paper. To the convenience of calculation, the transformation of polar coordinates to plane coordinates is performed: BS i BS denotes the squared distance, calculated as: The normalized moment of inertia is described as:  Table 4. Similarity/dissimilarity between Human and other species with different methods.  where α r BS and β r BS respectively denote the numerical representations of species α and β. So S(α, β) denotes the similarity distance between vectors α r BS and β r BS in the 4-dimensional space.

Results and Discussion
We test the performance of proposed method in the standard dataset that popular in the field of the DNA representation research, as seen in Table 1, the first exons of β-globin gene of different species. According to Eq. (9), Table 2 shows the numerical representations of DNA sequence for each target species. After obtaining the numerical representation consisted of 4-dimensional vectors, Table 3 shows the similarity/dissimilarity between pairs of species according the description of Eq. (10). For the comparison, Table 4 shows the similarity/dissimilarity between Human and other species in some other methods similarly taking the Euclidean distance as the measurement. From Table 4, finding that most listed methods 1,2,10,46,47 also make the same conclusion that Gorilla are the most similar species to Human and Chimp is the next similar species to Human except method 33 which make the similar conclusion that Chimp is the most similar species to Human and Gorilla is the next similar species to Human. Besides, some listed methods 2,10,47 also make the same conclusion that Gallus is the most dissimilar species to Human.
Normalizing S human−gallus = 1 to the convenience of the visualization for results in other paper 1,2,10,33,46,47 which similarly using the Euclidean measurement. As shown in Fig. 3, different methods perform different results that may be useful with different consideration.
In conclusion, the paper presents a novel method to extract the characteristic of the DNA sequence with the graphical and numerical operations which can effectively achieve the similarity/dissimilarity comparison of different species. In this method, the distribution of sequence to the curve of Fermat spiral remains the global position information successfully and the attachment of the mass to the point remains the local position information successfully. Specifically in our result, the group of Rat-Mouse-Lemur-Rabbit is more similar to the group of Human-Gorilla-Chimpanzee compared with the group of Goat-Bovine-Opossum which may be helpful to the exploration of the evolutionary relationship between species. Moreover, the similar pairs that obtained by our method illustrate the performance of proposed representation of DNA sequence.