Introduction

As the second most abundant form of human genetic variations, indels (insertions and deletions) also emerge as a significant source of variation that accounts for the majority of differences between species1,2. The presence of indels also contributes to the pathogenesis of diseases3 and changes in gene expression and protein functionality4.

According to the Human Gene Mutation Database (HGMD)5, indels are associated with at least 22% human severe diseases such as cystic fibrosis, fragile X syndrome, Huntington disease and as well as many types of cancer5,6. Indels in coding regions, even the ones that are in-frame, can lead to abnormal protein folding and protein degradation7. A well-known case of indel effects is cystic fibrosis, a genetic disease frequently caused by a 3-bp deletion within the coding region of CFTR8,9. Similarly, indels in noncoding regions can also cause human diseases due to expansion or shrinkage of repeats. A well-known case is fragile X syndrome caused by the expansion of short trinucleotide in the promoter region of the FMR1 gene10. This insertion changes the promoter methylation status and thus the gene expression pattern of FMR1.

With recent advances in next generation sequencing technology, many indel detection methods have been proposed11,12,13,14. All these studies yield encouraging results and play significant roles in understanding the origin of indels. These advances also provided a large amount of indel data and made it possible to analyze the genome-wide distribution of indels and their effects on humans15. However, there are still unanswered questions regarding how and where indel occurs.

DNA structural properties play important roles in many biological processes including protein-DNA interactions16, transcription initiation17, replication18 and meiotic recombination19, in which binding of proteins to DNA is influenced both by the sequence of nucleotides and by the shape of the DNA double helix20. DNA cleavage intensity is an effective index that can be used to predict the shape of the DNA backbone and the width of minor groove of genomic DNA at single-nucleotide resolution21,22. Since proposed by Tullius and Greenbaum23, it has been widely used to characterize structural features of DNA, such as functional noncoding regions24, nucleosomes25, replication origins18 and so on. However, no detailed systematic analysis of contribution of DNA structural property to the generation of indels has been performed.

As DNA cleavage intensity may affect DNA structure and exposure/accessibility to DNA binding enzymes and indels are thought to be generated by DNA amplification errors, we hypothesize that the formation of indels may correlate with DNA cleavage intensity. Therefore, in the present study, we conducted a computational analysis of indel distribution with respect to DNA cleavage intensity. We found that DNA cleavage intensity of the start and end points of indels was significantly lower than those in surrounding regions. This pattern not only holds in both human germline and somatic cells, but also holds in chimpanzee and mouse genomes, suggesting a model of indel formation in relation to DNA cleavage intensity. Our finding offers new clues to understand the mechanisms of indel formation and provides new direction for improvement of indel detection algorithms.

Results

Cleavage intensity profile surrounding the small indels

Altogether, we collected 2,656,597 small human and mouse indels (see Methods). Their detailed numbers on individual chromosomes are listed in Table 1 and their length distributions are shown in Figure 1a. Average lengths of human germline indels, human somatic indels and mouse indels are 2 bps, 3 bps and 4 bps, respectively.

Table 1 The number of indels on human, mouse and chimpanzee chromosomes
Figure 1
figure 1

Length distribution of indels.

(a) The length distribution of human germline (black), human somatic (red) and mouse (blue) small indels. Their average lengths are 2 bps, 3 bps and 4 bps, respectively. (b) The length distribution of large indels in the human (blue) and chimpanzee (red) genome. The average length of large indels is 840 bps for the human genome and 440 bps for the chimpanzee genome.

To investigate structural properties of the regions surrounding these indels, we calculated the DNA cleavage intensity of 200 bp sequences surrounding the indels, that is, −100 bp to +100 bp relative to the indel start sites (position 0) using ORChID220. The average cleavage intensity profile surrounding all the indels for the human genome is shown in Figure 2a and the one for individual chromosomes in Figure 2b (For clarity, individual chromosome's average cleavage intensity with 95% confidence interval is shown in Supplementary Figure S1). The pattern is amazingly consistent across all the chromosomes: cleavage intensity in the vicinity of indel start sites is significantly lower than other positions (Student's t-test, p<2.2 × 10−22). Similarly, the deep valley corresponding to very low cleavage intensity near indel start sites is also observed in the 16,742 human somatic indels (Figures 3a–b, Supplementary Figure S2) and the 1,439,788 mouse indels (Figures 4a–b, Supplementary Figure S3).

Figure 2
figure 2

The average cleavage intensity profile of regions surrounding germline small indels in the human genome.

(a) for the entire genome. The average cleavage intensity for each position from −100 bp to +100 bp relative to indel start site was indicated by red rectangles. The blue bars represent the 95% confidence interval. (b) for individual chromosome. The average cleavage intensity profiles for the regions from −100 bp to +100 bp relative to indel start site on each human chromosome.

Figure 3
figure 3

The average cleavage intensity profile of regions surrounding somatic small indels in the human genome.

(a) for the entire genome. The average cleavage intensity for each position from −100 bp to +100 bp relative to indel start site was indicated by red rectangles. The blue bars represent the 95% confidence interval. (b) for individual chromosome. The average cleavage intensity profiles for the regions from −100 bp to +100 bp relative to indel start site on each human chromosome.

Figure 4
figure 4

The average cleavage intensity profile of regions surrounding small indels in the mouse genome.

(a) for the entire genome. The average cleavage intensity for each position from −100 bp to +100 bp relative to indel start site was indicated by red rectangles. The blue bars represent the 95% confidence interval. (b) for individual chromosome. The average cleavage intensity profiles for the regions from −100 bp to +100 bp relative to indel start site on each mouse chromosome.

As indels include insertion and deletion mutations, the observed pattern of cleavage intensity in and around indels could be the average effect of the two types of indels. An interesting question to ask is “do these two types of indels have the same distribution patterns with respect to cleavage intensity as the pooled indels”? To answer this question, we used the ancestral information provided by the 1000 Genomes project26 to infer the directionality of indels and were able to annotate 185,234 insertions and 432,935 deletions for the human germline indels. The cleavage intensity profiles for the 200 positions from −100 bp to +100 bp relative to the start sites of these insertions and 432,935 deletions are shown in Supplementary Figures S4–S7. Overall, cleavage intensities around the start sites of both insertions and deletions are also significantly lower than their surrounding positions (Student's t-tests, p-value<1.6 × 10−22) and follow the same pattern as that of all small indels. However, compared to insertion indels, the contrast in cleavage intensity between indel vicinity and other surrounding regions is less pronounced for deletion indels (Figures S4 and S6).

Cleavage intensity profile surrounding large indels

Altogether, we obtained 10,599 and 5,822 large insertion indels in the human and chimpanzee genomes (see Methods), respectively. Detailed numbers for all the chromosomes are listed in Table 1 and length distributions are shown in Figure 1b. Average lengths of the large indels in the human and chimpanzee genomes are 840 bps and 440 bps, respectively.

We next analyzed structural properties of the regions surrounding these large indels in both human and chimpanzee genomes by calculating DNA cleavage intensity. The average cleavage intensity profiles for the positions from −100 bp to +100 bp relative to the start and end sites of the large indels in both human and chimpanzee genomes are shown in Figure 5. Similar to the pattern shown by small indels, cleavage intensities near the start and end sites of large indels were also significantly lower than other positions (t-test, p<1.7 × 10−22). However, large indels have their own distinct pattern of cleavage intensity as compared to that of small indels. Two valleys located at about +3 bp and +18 bp downstream of the start site were observed. Moreover, two valleys located at about −14 bp and −1 bp upstream of the end site of the large indels were also observed in both human (Figure 5a) and chimpanzee genomes (Figure 5b).

Figure 5
figure 5

Cleavage intensity for regions surrounding the start and end site of large indels.

(a) The average cleavage intensity profile for the positions from −100 bp to +100 bp relative to the start (top left panel) and end (top right panel) site of the large indels in the human genome. (b) The average cleavage intensity profile for the positions from −100 bp to +100 bp relative to the start (bottom left panel) and end (bottom right panel) site of the large indels in the chimpanzee genome.

Cleavage intensity profile surrounding SNPs

As a control analysis, we randomly sampled 17,000 human SNPs from UCSC genome database (hg19/snp138) and analyzed the cleavage intensity of surrounding sequences (from −100 to +100 bps). The average cleavage intensity profile of SNPs is shown in Figure 6. In contrast to indels, the cleavage intensity of SNP site is significantly higher than surrounding regions. Furthermore, we also randomly picked out 10,000 genomic positions and calculated the cleavage intensity for their surrounding sequences (from −100 to +100 bps). Figure 7 shows that the average cleavage intensity of random genomic regions exhibits random fluctuations and has no strong distribution pattern as compared to the selected sites and therefore is dramatically different from that of indel regions and SNP regions (Figures 2,3,4,5). Taken together, these results ruled out the possibility that the observed lower cleavage intensity near start or end site of indels is due to sequence bias.

Figure 6
figure 6

Cleavage intensity for regions surrounding SNPs in the human genome.

The average cleavage intensity profile for the positions from −100 bp to +100 bp relative to SNPs.

Figure 7
figure 7

Cleavage intensity for random genomic positions in the human genome.

The average cleavage intensity profile for the positions from −100 bp to +100 bp relative to random genomic positions.

Discussion

In this work, we examined the cleavage intensity profile around 2,656,597 small indels and 16421 large indels in the human, chimpanzee and mouse genomes. Small indels range from one to 50 bps and large indels from 80 to 12,000 bps. The indels obtained from the human 1000 Genomes projects26 and the mouse indels are expected to be enriched with germline indels, whereas the human somatic indels should be mostly somatic as the majority of them are identified through various cancer projects26.

For small indels, the cleave intensity profile shows a deep valley in the downstream of indel start sites (Figures 2,3,4 and Supplementary Figures S1–S3) and the cleavage intensity in the valley is significantly lower than other positions. The pattern holds for both insertions and deletions. Interestingly, insertions and deletions show two major differences. First, the contrast in cleavage intensity between indel vicinity and other surrounding regions is less pronounced for deletions than insertions (Figures S4 and S6). Second, the average cleavage intensities of insertions (Figures S4–S5) are a little higher than that of deletions (Figure S6–S7). To examine what may cause the differences, we ran RepeatMasker (http://www.repeatmasker.org) on the 200 bp (100 bp upstream and 100 bp downstream of indel occurrence sites) of the indel sites to identify repetitive sequences and classified the insertions and deletions based on the types of repeats they have. If different types of repetitive sequences cause the different patterns seen in insertions and deletions, we expect that there will a nonrandom distribution of these repeat types. Indeed, the results of the hypothesis test27 as reported in Table 2 show that, compared with deletions, insertions are enriched in SINEs but short of LINEs, LTR retrotransposons, simple repeats and DNA elements. Therefore, for the indels that we were able to identify insertions and deletions, the difference seen in their cleavage intensity seems to be caused by different repeat sequences.

Table 2 Results of the two-proportion z-test of repetitive elements in annotated insertions and deletions

Compared to that of small indels, the cleavage intensity profile of large indels shows a more complicated pattern: there are two valleys near the downstream of indel start sites and also two valleys near the upstream of indel end sites (Figure 5). The patterns hold across chromosomes, species and also regardless of whether the indels are somatic or germline. Therefore, our results suggest that indel distributions are strongly associated with DNA cleavage intensity and indels tend to occur in low cleavage intensity regions.

The observed distinct structural difference reflected by cleavage intensity between regions of close proximity to indels and those further away provides new insight into indel generation mechanisms. It has been demonstrated that small indels are generated due to strand slippage during DNA replication28,29. All the known DNA polymerases can generate indels30 due to DNA strand slippage in the process of DNA synthesis. Although DNA polymerases can monitor and correct mutations using the proofreading mechanism, efficiency of proofreading for indel mismatches varies with sequence context and structure28. It has been reported that many DNA polymerases monitor the correct base-pairing by hydrogen bonds with the minor groove and van der Wass contacts with bases30. However, abnormal geometry DNA sequences can result in steric clashes in and around the activate site that precludes efficient catalysis30. Therefore, the observed rigidity at the start site of small indels may facilitate template displacement involved in strand slippage initiation as demonstrated by a recent theoretical model29 and may also prevent polymerases from binding to this region and then lower down the proofreading efficiency of polymerases.

Besides strand slippage, other mechanisms of generating small indels require single-stranded or double-stranded breaks and repairs mechanisms such as break-induced replication, nonhomologous end joining and microhomology-mediated end-joining31. All these processes require the action of different nucleases, primase, synthesis and the involvement of different nonreplicative, low fidelity repair polymerases with very different error rates of incorporating a wrong base31,32,33. Therefore, the cleavage intensity differences between regions of close proximity to indels and those further away may be helpful to the creation of single-stranded or double-stranded breaks and also may hinder the binding of nucleases, primase or polymerases to DNA, which is influenced by the shape of the DNA double helix.

It also is interesting to consider why the cleavage intensity is significantly lower at both the start and end point of large indels (Figure 5). One mechanism of large indel generation is due to the proliferation and illegitimate recombination of transposable elements34,35, which is clearly different from that of small ones. Large indels considered in the present work are all associated with retrotransposons that move around by a "cut and paste" process in the genome36 (Polavarapu, et al. 2011). Shown in Figure 8, DNA at the target site is cut in an offset manner (like the "sticky ends" produced by some restriction enzymes) and after the transposon is ligated to the host DNA, gaps are filled in by the Watson-Crick base pairing rule. In this process, identical direct repeats (DR) will be generated at each end of the retrotransposon. The distance (about 13 bps) of the two pairs of valley observed at the both ends of large indels (Figure 5) is in accordance with the average length of the DR that is 13 bps37. Therefore, the observed rigidity at both ends of large indels may facilitate the endonuclease to recognize and cut the target DNA.

Figure 8
figure 8

Mechanism of large indel generation through retrotransposition.

Two red angles indicate the enzyme cut site. DR is short for identical direct repeats indicated in red regions. The brown region at the bottom panel is the large indel generated due to the insertion of a retrotransposon.

Previous studies have shown that SNPs are preferentially distributed in nucleosome positioning regions, whereas indels seem to show different distribution patterns but it is unclear what DNA structural properties affect indel distribution38. Our current study provides insight into this problem, revealing the strong pattern that indels tend to locate in regions of the chromosome with low cleavage intensities, whereas SNPs tend to locate in regions with high cleavage intensities (Figure 6). Considering that genomic regions with high cleavage intensity are prone to form nucleosomes25, the observed distinct cleavage intensity patterns between indel and SNPs may be also attributable to their different distribution patterns relative to nucleosomes. We could also conjecture that DNA structural feature reflected by cleavage intensity boosts indel mutations in two ways regardless of indel generation mechanisms (i.e., strand slippage, unequal crossing over, retrotransposition, etc.). First, due to the low cleavage intensity in and near the regions where indels appear, errors resulting in indels are difficult to fix as the hydroxyl activity is low in the region and enzymes cannot easily find and fix the errors. Second, also because of the low cleavage intensity, the DNA in and near indels is rigid, fragile and easy to break. For majority of the possible mechanisms of indel generation, DNA breaks, either one stranded or two stranded (e.g., the sticky double stranded breaks during retrotransposition), are involved during the process and the low cleavage intensity is necessary and facilitate the break. The two valleys near both the start and end of large indels generated by retrotransposons (Figures 5) show strong support to our conjecture here.

Our current finding suggests that cleavage intensity can be used to assist the prediction and identification of indels. It is well known that indels pose great computational challenges to both short reads mapping and indel calling algorithms11 and there can be many false positives during indel calling39. With what is observed in our study, it is easily imaginable that cleavage intensity is an important DNA structural feature that one can consider when predicting or confirming the presence of indels, so indel calling tools can incorporate cleavage intensity as a main feature for training and classification of indels. In fact, cleavage intensity has already been incorporated into the prediction of a variety of biological properties, such as transcription factor binding sites40, eukaryotic core promoters17 and DNA replication origin18.

Methods

Human and mouse small indel data

The Ensembl variation database stores different types of variants including single nucleotide polymorphisms (SNPs), small indels (i.e., indel sizes are less than 50 bps) and structural variants from different species. However, information on indels is only limited to human and mouse genomes. From the Ensembl database, we extracted small indels of the mouse genome and small somatic indels of the human genome. From the 1000 Genomes Project website, http://www.1000genomes.org/, we also obtained the information of germline small indels in the human genome. To obtain a high quality dataset, indels were selected according to the following two criteria: (1) Indels with multiple annotations were discarded; (2) The selected indels are at least 100 bps apart from others. Finally, we obtained 1,200,067 germline and 16,742 somatic indels in the human genome and 1,404,325 small indels in the mouse genome.

Based on the reference genome sequences of humans (hg19) and mice (mm10) obtained from the UCSC genome database (http://genome.ucsc.edu/), 200 bp sequences, 100 bps upstream and 100 bps downstream of the start position of each indel, were extracted from the two reference genomes.

The frequency of insertions and deletions and the frequency of frameshifting and non-frameshifting indels in human germline, human somatic and mouse small indels are shown in Supplementary Figures 8 and 9, respectively.

Human and Chimpanzee large indel data

The large indel (80 to 12,000 bps in length) data for human and chimpanzee genomes was obtained from Polavarapu, et al.40. Most of these indels were generated due to insertions that are associated with retrotransposons. Based on their data, we obtained 10,599 and 5,822 large insertion indels in the human and chimpanzee genomes, respectively. As these large indels were identified for different genome assemblies, to maintain the consistency, the same versions used by the original study, human hg17 and chimpanzee PanTro2, were obtained from the UCSC genome database (http://genome.ucsc.edu/) for downstream large indel analyses. Similarly, 200 bps, 100 bps upstream and 100 bps downstream of the start position of each indel, were extracted from the two reference genomes.

The frequency of frameshifting and non-frameshifting indels in Human and Chimpanzee large indels are shown in Supplementary Figure 10.

Calculation of cleavage intensity

Cleavage intensity indicates the likelihood of DNA cleavage by hydroxyl radicals and provides a map of local variation in the shape of DNA backbone. The lower the cleavage intensity is, the more rigid the DNA is. Cleavage intensity can be calculated from parameters for a set of tetranucleotides in a given DNA sequence. The parameters of the 44 ( = 256) tetranucleotides were derived from experiments in which DNA sequences were exposed to hydroxyl radicals21. Recently, Bishop et al.20 developed the ORChID2 algorithm (http://dna.bu.edu/orchid/) to calculate DNA cleavage intensity according to the following equation21,

where Ci is the cleavage intensity at position i, Ti-j+1 the hydroxyl radical cleavage intensity of the tetramer starting at position i-j+1 and j the j-th nucleotide in the tetramer. The ends of the DNA are calculated similarly, except that cleavage data are retrieved from only one, two, or three tetramers, rather than four. Accordingly, we can compute the cleavage intensity for each nucleotide in a DNA sequence by using ORChID2. In this way, a DNA sequence is converted into a numerical sequence with each nucleotide represented by the DNA cleavage intensity.