DNA word analysis based on the distribution of the distances between symmetric words

We address the problem of discovering pairs of symmetric genomic words (i.e., words and the corresponding reversed complements) occurring at distances that are overrepresented. For this purpose, we developed new procedures to identify symmetric word pairs with uncommon empirical distance distribution and with clusters of overrepresented short distances. We speculate that patterns of overrepresentation of short distances between symmetric word pairs may allow the occurrence of non-standard DNA conformations, such as hairpin/cruciform structures. We focused on the human genome, and analysed both the complete genome as well as a version with known repetitive sequences masked out. We reported several well-defined features in the distributions of distances, which can be classified into three different profiles, showing enrichment in distinct distance ranges. We analysed in greater detail certain pairs of symmetric words of length seven, found by our procedure, characterised by the surprising fact that they occur at single distances more frequently than expected.


Materials and Methods
Materials. We used the complete DNA sequences of the human genome, downloaded from the website of the National Center for Biotechnology Information (NCBI). We processed the available assembled chromosomes (GRCh38.p2) as separate sequences. All ambiguous or unsequenced nucleotides, i.e., all non-ACGT symbols, were considered sequence delimiters.
We also used pre-masked sequences 10 available from the UCSC Genome Browser (http://genome.ucsc.edu) downloads page. These files contain the same GRCh38 assembly sequences, but with repeats reported by RepeatMasker 11 and Tandem Repeats Finder 12 masked by Ns.
To address the problem of possible assembly artefacts, we also used the whole-genome shotgun assembly (WGSA, to which we refer as "Celera") of the human genome generated at Celera in December 2001 13 , and the May 2007 HuRef genome of J. Craig Venter, sequenced with capillary-based whole-genome shotgun technologies using the Applied Biosystems 3730xl DNA analyser, and de novo assembled with the Celera Assembler 14 , to which we refer as "HuRef ".
Distance between symmetric word pairs. Consider the alphabet  = A C G T { , , , } and let w be a symbolic sequence (word) defined in  k , where k is the length of w. In this work, the pair composed by one word, w, and the corresponding reversed complement word, w′, is called a symmetric word pair. For example, (AC, GT) is a symmetric word pair.
We are interested in finding the distance between a given w and w′, with no w or w′ between them. As an example, consider w = AC and the sequence ACTACTCCGTACTATAGTCGT . In this example, there are three occurrences of the word AC (underlined), but only the 2nd and the 3rd occurrences are considered for the calculation of distances to their nearest reversed complements (overlined), since between the 1st and the 2nd occurrences of w there are no occurrences of w′. Distances are measured between the start positions of the words, so a distance d between reversed complements of length k implies that the words are separated by (d − k) intervening nucleotides. In this example, d = 5 for the first (AC, GT) pair and d = 6 for the second.
Distances d < k may only occur if a suffix of w matches a prefix of w′. On the other hand, d = k is impossible for words such as CGCG. To avoid this dependence on the specific composition of w, distances d ≤ k are not considered for analysis.
The distribution of the distances of nearest reversed complements (DNRC) is denoted as f w,w′ . Note that f w,w′ may be different from f w′,w .
For a fixed word length, k, we are also interested in the overall DNRC distribution across all the symmetric word pairs. We define the global DNRC distribution, f k , as a weighted sum of the DNRC distributions of all symmetric word pairs with words of length k, where n w,w′ is the number of observations of nearest pairs of symmetric words (w, w′) of length k, and n is the total number of such distances. Only the analysed distances (d > k) are counted in n and n w,w′ . For generating the symmetric words, we used a simple algorithm that, for each position in the DNA sequence, i, and associated word of size k, w, searches for the first occurrence of w′. If w is found before w′, the algorithm skips to the next position i. For practical reasons, a maximum searching distance is specified by the user, allowing the program to maintain in memory a table with all possible words w and the corresponding number of occurrences at each distance.
In order to study the behaviour of the empirical global DNRC distributions of the human genome, f k , we carried out comparisons with the DNRC distributions obtained from nucleotide sequences generated by a k-order Markov process (random background). The expected global DNRC distribution under k-order Markov dependence, f k e , can be deduced using the transition probabilities and a state diagram that represents the progress made towards identifying w or w′ as each symbol is read from the sequence. The algorithm used to find this exact distribution 15 is a special case of Fu's procedure based on finite Markov chain embedding 16 .
Parameter assumptions. The stem and loop lengths of hairpin/cruciform structures seem to vary over a wide range. According to different authors, the stem length varies between 6 and 100 nucleotides, while loop lengths may range from 0 to 2000 nucleotides 6,17,18 .
Since this study intends to characterise the short distances between symmetric words, but avoiding the direct word dependencies, a range of distances from (k + 1) to 1000 was considered for computing all the DNRC distributions. Taking into account computational limitations and the possible stem length of cruciform structures, the histograms of the DNRC were computed for all symmetric word pairs of lengths up to seven, for all human chromosomes. For each of these sequences, a global DNRC distribution, comprising all symmetric word pairs of the same length, was also determined.
Chromosome homogeneity. To assess the homogeneity of the global DNRC distribution, for a fixed k, among all chromosomes of the genome, we used the phi coefficient, where O w,j is the observed frequency count of distances from w to w′ in chromosome j, and E w,j is the expected frequency count under homogeneity, with  ∈ w k and ∈ … j X Y {1, , 22, , }.
The assumption of homogeneity of the distance distributions of the chromosomes allows us to discuss the statistical properties of the complete genome based on a sequence with all chromosomes concatenated.
Residual analysis. From the perspective of molecular evolution, DNA sequences may reflect both the results of random mutation and of selective evolution. In order to highlight the contribution of selective evolution, one should subtract the random background from the simple counting result 19,20 . To this purpose, the global DNRC distributions expected under the k-order Markov dependence, f k e , were obtained and the goodness-of-fit was evaluated by the ϕ measure (ϕ = 0 reveals a perfect fit between the distributions). To explore the differences between the empirical and the expected distributions, a residual analysis was carried out through the calculation of standardised residuals for a given distance d, are given by k k e where n is the total number of observed distances between symmetric pairs of length k and σ =  is the standard deviation of a binomial distribution. These standardised residuals are used to highlight the contribution of the selective evolution on the relative position of the symmetric word pairs. We recall that, under k-order Markov dependence assumption, each standardised residual has an asymptotic standard normal distribution 21 .
The focus of this study is mainly in the short distances between symmetric word pairs, thus we fixed a maximal distance to 1000. The global Type I error was fixed to α = 5% and, for each distance comparison test, it was correct to 0.05 (1000 − k). So, absolute residuals greater than four are considered to be significant residuals.
Short distances between reversed complements may be related with the occurrence of cruciform structures, with maximum loop length of twenty nucleotides 6 . To identify a thresholding distance which may discriminate the overrepresented short distances from the underrepresented, we assumed that short distances up to the threshold are overrepresented and the others are underrepresented (this assumption makes sense under the hypothesis of enrichment of words able to form cruciform structures).
We determined the thresholding distance, d, as the distance that maximises the sum of the number of significant positive residues less than d and the number of significant negative residues greater than d. We defined a discriminator function as a sum of indicator functions (for example, 4} and r is defined in Equation 4. The value d that maximises the discriminator function, R, was considered as the thresholding distance.

Results and Discussion
The global DNRC distribution was determined for each of the 24 human chromosomes and each word length = … k ( 1, ,7). These distributions have heavy tails, strongly affecting the chi-square statistic. To avoid this problem, for each k, a cutoff distance was defined as the 99th percentile of the DNRCs observed in the complete genome (all chromosomes). Distances larger than this cutoff were lumped together into a residual class, in each distribution. Naturally, the DNRCs and hence the cutoff distances were found to increase with word length in the human genome, as would be expected even in a sequence of randomly generated nucleotides.
We measured the degree of homogeneity (ϕ effect sizes measure) between the human chromosomes, for the global DNRC distributions. According to the obtained ϕ values (ϕ < 0.04), we conclude that the homogeneity effect is weak. Thus, we consider that there is homogeneity between the global DNRC distributions of the several chromosomes. This chromosome homogeneity in the global DNRC distributions points to a general feature of the complete human genome, which may be due to genomic architecture constrains.
Global DNRC distributions for the complete genome. The discrepancies between the global DNRC distribution in the human genome and in the k-order Markov process were measured by ϕ effect size measure. Although the misfit effect is not strong, it is nevertheless non-negligible. The ϕ values are always greater than 0.05 and the p-values smaller than 0.05. Figure 1 shows the global DNRC distributions of the human genome and the global DNRC distributions of the k-order Markov random sequence, for k = 6 and k = 7. The misfit between the human distance distributions and the corresponding k-order Markov process is clear. Analysing the residuals between the empirical distribution and the distribution of this random background, we observe a tendency of overrepresentation of short distances in the human genome, for all analysed values of k. Figure 2 presents the results of the residual discriminator function (the R profile), for the global DNRC distribution of the complete human genome (between the observed and the corresponding k-order Markov process), for k = 6 (left) and k = 7 (right). The discriminator functions increase for <  Although some words may be visually identified as having enrichment of short distances, it is not always possible to perform meaningful statistical analysis, due to the small number of occurrences of the DNRC. Moreover, the runs of significant positive residuals (r > 4) are related with the ranges of overrepresented short distances.   The following procedure was developed to identify the DNRC distributions containing islands of favoured short distances, for a given k: • We exclude the symmetric word pairs with occurrence frequency lower that 0.0001; • We exclude the pairs (w, w′) such that f w,w′ (d) = 0 for more than 5% of the distances d in [k + 1, 1000]; • The misfit between f w,w′ and f k is evaluated by the phi coefficient. The symmetric word pairs with ϕ > 0.80 are considered to have a very strong effect size 22  The four successive filters of the procedure above reduce the initial set of words to 17, for k = 6, and to 48, for k = 7. Note that other thresholds could have been used in the procedure, which would result in the selection of different subsets of words (see the Supplementary Material).
In order to classify the shape of the DNRC distribution of each symmetric word pair, a residual discriminator function R was obtained for each word pair, based on adjusted Pearson residuals, r a , computed from the contingency table of all words of length k and distances between (k + 1) and 1000, instead of the standardised residuals (eq. 4). The adjusted Pearson's residuals are given by where n is the total number of DNRC counts for a given word length k, as defined in (1). The symmetric word pairs were classified in three different types, according to the R graphical profile: T 1 -A profile showing a marked initial increase, reaching its maximum, and stabilising or decreasing after it; see, for example, Fig. 4 (left); T 2 -A profile showing an initial decrease, comprising smooth or strong inverted peaks; see, for example, Fig. 4  (right).
T 3 -Other profiles, not matching previous criteria. The pairs of type T 1 are characterised by high residual discriminator values (max(R) > 50), and their DNRC distributions show an enrichment for short distances (d < 100). See, for example, Fig. 3, left. Pairs of type T 2 also have high residual discriminator values, but their DNRC distributions show an overrepresentation for distances d > 100, with localised bell-shaped peaks. See, for example, Fig. 3, right. All pairs of type T 3 have irregular low-R profiles (max(R) ≤ 50). Table 1 presents the subset of symmetric word pairs obtained by our procedure, for k = 7. The table also contains the maximum DNRC frequency and the corresponding distance, the max(R) values, the distribution type, and the distance peak location class. It was observed that type T 1 is the largest group and is formed by TA-rich words. Most DNRC distributions of this type reach their maxima for d < 100 (C1). Curiously, it was reported that, in E. coli, cruciform formation is enhanced by TA-rich sequences and may correlate with transcriptionally-active promoters 23,24 . Also, the cruciform-binding protein PARP-1 (Poly(ADP-ribose) polymerase-1), which is involved in DNA recombination and repair, was shown to interact with promoter-localised cruciforms 25 , and promoters are frequently enriched with TA elements 26 . Thus, the overrepresentation of short distances of TA-rich symmetric word pairs, detected by the procedure that we propose, may point to the occurrence of hairpin/cruciform structures.
The proposed procedure also identifies the T 2 group. DNRC distributions in this group have localised bell-shaped peaks for d > 100, forming marked islands of favoured distances. The occurrence of peaks in the short distance region of the DNRC distribution could signal the formation of hairpin/cruciform structures. However, DNRC distribution peaks for d > 100 could be associated to other structural or functional DNA functions.
Single over-favoured distance. Apart   single distance very highlighted, due to its high frequency. In order to perform an automatic selection of this kind of words, we defined the procedure: • We start with the complete set of symmetric words of a fixed length k; • We exclude the symmetric word pairs with occurrence frequency lower that 0.0001; • We exclude the pairs (w, w′) such that f w,w′ (d) = 0 for more than 5% of the distances d in [k + 1, 1000]; • The remaining words are sorted by the maximum frequency, max(f w,w′ ), of the distances under analysis = + … d k 1, , 1000);  Table 1. Words of length seven with DNRC overrepresentation of short distances, identified by our procedure, with indication of DNRC distribution maximum and its argument value, discriminator R function maximum, group type (T1 and T2) and class of peak distance. Distance peak classes: C 1 (d < 100), C 2 (d ≈ 100), C 3 (100 < d < 200) and C 4 (d > 200).
The first words obtained by these criteria identify words with a single over-favoured distance. To the purpose of classifying the words with relation to single favoured distance, we defined three subsets: peak in distances d ≤ 30, peak in distances 30 < d ≤ 200 and peak in distances d > 200. Table 2 shows the first five words obtained by the previous procedure, for each peak interval type. Taking into account the expected decrease of the distribution, the peak in distances d > 200 is an obvious unexpected behaviour. It is noteworthy that for some words a single distance accounts for about 70% of occurrences in a total of (1000 − k) distances. Figure 5 presents the DNRC distribution of CATTAGG (first word in Table 2). This symmetric word pair shows a single over-favoured distance d = 14 ( ≈ . ). In a y-axis zoom (right), a local island of favoured distances is observed. However, in general, these frequencies do not surpass the global distance distribution behaviour. Figure 6 shows another example of a symmetric word pair with a single over-favoured distance at d = 133.
In the absence of obvious biological motivation for the occurrence of these single over-favoured distances, we conducted further analyses for some word pairs that have these features. To address the possibility that the reported behaviour may result from a sequencing procedure artefact, we studied the pair (CCACAAT, ATTGTGG) in detail. Using three independently sequenced and assembled genomes (Celera, HuRef, GRCh38.p2), we computed the distance distributions and found a similar peak in each (see Table 3, displaying the frequencies around distance 133), ruling out the hypothesis that the observed distance peaks are sequencing or assembly artefacts.
We further analysed the sequences comprised between CCACAAT and ATTGTGG. Taking into account the sequence direction, the distance 133 was only enriched for CCACAAT to ATTGTGG (15599 occurrences) but not for ATTGTGG to CCACAAT (11 occurrences, which is in the expected range). The sequence logo (not shown) for the 15599 sequences of the GRCh38.p2 human genome, obtained using WebLogo 3.4 27 , shows a significant degree of conservation, suggesting that these sequences may be part of repetitive DNA segments. Using the genomic coordinates for the CCACAAT words which are at distance 133 from ATTGTGG words, we searched the RepeatMasker annotations available from the UCSC  active transposable elements that also mobilise non-autonomous elements, such as Alu sequences, thus shaping the genome landscape and variation, with implications in evolution and disease 28,29 .
Masked Sequences. To reduce bias from known repetitive sequences in the original genome assembly, we also analysed a pre-masked version of the genome (as reported by RepeatMasker and Tandem Repeats Finder). Masked sequences exclude major known classes of repeats 30 , such as long and short interspersed nuclear elements (LINEs and SINEs), long terminal repeat elements (LTRs), Satellite repeats or Simple repeats (micro-satellites). As expected, masking eliminates distance peaks in several DNRC distributions. For instance, the DNRC distribution of w = CCACAAT (Fig. 6) loses the strong peak observed for the complete genome, because the enrichment of distance 133 is due to LINEs repeats. However, the peaks are preserved in several other distributions.
To select words with single over-favoured distances, in these masked sequences, we applied the procedure described in the previous section. As before, results were classified in three subsets.
The highest-ranking words in the d ≤ 30 group are TA-rich words. Also, we observed that the shape of the DNRC distributions for these words remain unchanged by the masking of repeats. These distributions preserve their characteristic islands of enriched short distances. The distributions of highest-ranking words in the other two groups do not show islands of favouring distances. They display just one or a few strong peaks in the repeat-masked genome.
Globally, the distance peak of the DNRC distributions in the repeat-masked genome correspond to a local maximum in the original genome (80% of the distributions). Table 4 shows the first five words obtained for each subset, for k = 7. The words reported in Table 4 do not show up in the top-5 list of the complete genome sequence (Table 2). Nevertheless, the peaks detected in their distributions are also local maxima, or even global, in the corresponding non-masked distribution. As an example, Fig. 7 shows the DNRC distribution of TATGAAT in the complete genome (left) and in the repeat-masked genome (right). Distance 63 is the distribution mode (peak) in the repeat-masked genome, and also a local maximum in the distribution extracted from the complete genome.   The peaks of DNRC distributions of words in Table 4 were analysed in order to assess the existence of biological features. The peak distances in the d ≤ 30 subset arise from the overall contribution of several chromosomes. For the words in the other subsets, there is clearly a chromosome that is the main contributor to the single distance peak (see Table 4). Annotations within genomic coordinates for the words listed in Table 4 were retrieved from UCSC GENCODE v24 (https://genome.ucsc.edu/cgi-bin/hgTables) and the resulting gene lists were analysed with the functional annotation tool in DAVID 31,32 . Overall, word pairs with peaks at distances d > 30 are enriched in genes with several and well-defined protein domains, namely, DNA-binding Zinc-finger proteins and members from the neuroblastoma breakpoint family (NBPF). These are duplicated genes with extreme copy number expansion that are associated with brain development and pathology, and are located in a human-specific pericentric inversion in chromosome 1 33,34 . Word pairs with distance peaks at d ≤ 30 are scattered throughout the genome, and show enrichment in genes associated with the membrane, which also display a conserved protein topology. As in the T 1 group of the complete genome, the words of this subset are TA-rich which may be associated with cruciforme structure occurrence.

Conclusions
We developed new procedures to describe some characteristics of genomic words. In particular, the relative position and distance between reverse complemented word pairs was addressed, using the notion of distance to the nearest reversed complement (DNRC). Under this framework, we studied the DNRC distribution of each word in comparison with the global DNRC distribution and verified the homogeneity of the global DNRC distribution across human chromosomes, for word sizes 1 ≤ k ≤ 7.
Using these novel procedures for genomic word detection, we were able to find words with unexpected features in the DNRC distribution, which could not be detected by word frequency procedures alone. The detection of pairs of symmetric words that occur very often at a fixed distance (e.g., the pair (CCACAAT, ATTGTGG) at distance 133) suggests structural characteristics of the DNA. Some of these are already known but some others may be new.   We explored the global DNRC distributions of words of lengths k = 6 and k = 7 in the human genome, comparing them with the expected distributions obtained under k-order Markov dependence. A lack of fit was globally detected. The global DNRC distributions show a strong overrepresention of distances up to 350, a feature that may be associated with the occurrence of cruciform structures.
The DNRC distributions of some word pairs display significantly overrepresented distances. In the complete genome, those distributions fall into one of several distinct patterns: distributions with islands of favoured distances d < 100 (typically TA-enriched words); distributions with islands of favoured distances between 50 and 350; distributions with a single overrepresented distance. In the masked genome version, distributions with islands of favoured distances for d ≤ 30 (typically TA-enriched words) and distributions with single over-favoured distance for d > 30, were observed. Some of these peaks are present in both complete and masked genomes, thus they are not related to the major known classes of repeats.
DNA structures such as stem-loops and cruciforms are formed at sites that contain reversed complementary words. For this reason, their study naturally leads to the study of the symmetry properties of the sequences, and in particular to the study of the distribution of distances between nearest reversed complements. We performed an exhaustive study of these distance distributions and identified words that are strong candidates to the formation of cruciform structures in human DNA. We are convinced that the new procedures defined and proposed in this work are relevant for a better understanding of the structure of DNA.