Storing biologically equivalent indels as distinct entries in databases causes data redundancy, and misleads downstream analysis. It is thus desirable to have a unified system for identifying and representing equivalent indels. Moreover, a unified system is also desirable to compare the indel calling results produced by different tools. This paper describes UPS-indel, a utility tool that creates a universal positioning system for indels so that equivalent indels can be uniquely determined by their coordinates in the new system, which also can be used to compare different indel calling results. UPS-indel identifies 15% redundant indels in dbSNP, 29% in COSMIC coding, and 13% in COSMIC noncoding datasets across all human chromosomes, higher than previously reported. Comparing the performance of UPS-indel with existing variant normalization tools vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants shows that UPS-indel is able to identify 456,352 more redundant indels in dbSNP; 2,118 more in COSMIC coding, and 553 more in COSMIC noncoding indel dataset in addition to the ones reported jointly by these tools. Moreover, comparing UPS-indel to state-of-the-art approaches for indel call set comparison demonstrates its clear superiority in finding common indels among call sets. UPS-indel is theoretically proven to find all equivalent indels, and thus exhaustive.
Indel stands for insertion or deletion of bases in a DNA sequence. As the second most common form of genetic variation, indels play an important role in genome and protein evolution. Due to artificial factors such as sequencing errors, ambiguous alignment of the reads, inconsistent ways of representing the same variant by different tools, the same mutation may be recognized as distinct variations occurring at different locations1,2,3. For example, consider a reference sequence AGGAAAGAAAGAAAGAAAGAG ranging from position 100285630 to 100285650 and two indels stored in dbSNP, rs147659011 (GAAA/+) and rs60376183 (AAGA/+), annotated to this region with positions 100285632 and 100285650, respectively. Although these indel mutations may indeed occur at different positions, they are biologically equivalent because they result in the same altered sequence AGGAAAGAAAGAAAGAAAGAAAGAG. Supplementary Table 1 shows another example of redundant insertion and deletion in dbSNP. Since many databases such as dbSNP, Database of Genomic Variants (DGV), and Ensembl combine indels resulting from large-scale studies, similar cases often exist in those databases, leading to a nonnegligible problem of data redundancy. In fact, about 10%4 of the human indels stored in dbSNP and 18%1 in Ensembl are redundant. Resolving the indel redundancy in major databases is important for subsequent genetics research. Nevertheless, this problem has not been given the attention it deserves.
Numerous approaches have been developed for systematic comparison of indels to determine equivalence and hence solve the redundancy problem. The “strict matching” approach matches two indels if they share the same position, reference, and alternate alleles in two different entries in the VCF file. However, as demonstrated in ref.3, this approach fails to find equivalent indels that are not identical. The “distance based approach” treats two indels as equivalent if both have the same length and occur within a certain distance such as ±5 bp5 or ±25 bp6. However, this approach introduces false positives when neighboring indels are not equivalent1 and misses equivalent indels that are farther apart than the distance cutoff. Clearly, selection of an optimal distance cutoff is a tradeoff of the two types of errors: smaller distance cutoffs result in a decreased false positive rate but an increased false negative rate.
To address the limitations of the two aforementioned approaches, the more widely used “normalization” approach attempts to solve the indel redundancy problem by left (or right) normalization, i.e., consistently shifting the start position of an indel to the left (or right) as long as the resulting sequence is the same as the one generated by the original mutation7. Tools using this type of variant normalization include vt normalize2, BCFtools8, and GATK LeftAlignAndTrimVariants9. These tools usually take a VCF file as input, output another VCF file with canonical VCF entries for the indels after normalization, and then perform “strict matching” to find equivalent indels with exactly the same canonical representation. The normalization approach generally performs well in identifying equivalent indels, but as shown in the result section, fails to normalize complex variants.
The positions of indels may get changed after left/right normalization, potentially misleading downstream analysis. For example, the deletion rs536379477 resides in the exon of the transcript ENST00000590192.1, but the equivalent deletion rs41436444 is in the intron of the same transcript. Therefore reporting these two indels with the same normalized position might lead to missing significant insight into genetic diseases or phenotypes of interest. Since the exact positions of most indel variations are not known, it is thus best to represent the indel of interest with a range of positions, within which equivalent indels can occur, rather than as a single normalized position. A similar idea was proposed by Krawitz et al.10.
This paper proposes UPS-indel, a universal positioning system for indels, whereby every indel variant is represented by a range of positions within which all equivalent indels can occur. This representation is added to the VCF file resulting in a UVCF file containing not only the original indel calling results, but also the complete representation of all equivalent indels. The advantage of adding this column of information to the existing VCF file is (1) the original VCF file structure is unchanged so the UVCF file is still compatible with many downstream programs, (2) the UPS-indel notation facilitates the comparison of indels from different VCF files, (3) for equivalent indels that overlap both coding and noncoding regions, having the range column in the indel calling output would allow a downstream indel annotation system to consider the range rather than a single position, possibly annotating both a coding and noncoding variant. In summary, this work extends the previous work of Krawitz et al.10 and Assmus et al.1 by a new coordinate system Universal Positioning System (UPS), a rigorous mathematical proof that all (deletion and insertion) equivalent indels are found, the handling of complex variants, and a simple modification of an input VCF file to produce an output UVCF file containing the indel equivalence information. Results show that UPS-indel identifies more redundant indels than the existing approaches, also enables a comparison between indel calling results produced by different indel callers, and performs better than other state-of-the-art approaches for finding indels in common among call sets.
Materials and Methods
This section defines some terms frequently used in this paper.
A sequence that is produced by introducing a specific indel to the reference sequence at a specific position. This is also known as the mutant sequence.
Let R be the reference sequence and p be either an insertion or a deletion of a given length that occurs at a given position in the reference sequence. The alternate sequence for insertion is denoted by R′ I = R + p and for deletion by R′ D = R − p.
Two indels are considered equivalent if and only if they produce the same alternate sequence. Note that equivalent indels must be of the same type (insertion and deletion) and same length.
Equivalent indels that are reported as distinct entries in a VCF file are defined as redundant indels.
Region of Equivalence
This is defined as the range of positions in the reference sequence where equivalent indels occur.
A permutation (y 0 , y 1, y 2, …, y n−1) = f(x 0 , x 1, x 2, …, x n−1) where y i = x (i+k)mod n for 0 ≤ i ≤ n − 1, k can be positive (left cyclic) or negative (right cyclic). For example: for a string “ATCG”, the left cyclic permutations are TCGA, CGAT, and GATC; the right cyclic permutations of this string are GATC, CGAT, and TCGA.
Table 1 shows an example of equivalent indels. Observe that all equivalent indels are cyclic permutations of each other (e.g., a cyclic permutation of CT is TC and cyclic permutations of TGT are GTT and TTG) and equivalence continues until there is a mismatch (see Supplementary Table 2). This observation leads to the following theorem.
Theorem 1: All equivalent indels in the region of equivalence are cyclic permutations of each other.
Proof: Consider two equivalent indels d 1 and d 2 and the equivalence region R they define.
For insertion within R, the alternate sequences are
d 1 S = Sd 2
for some nonempty S. For deletion within R, the alternate (possibly empty) sequence is S starting with d 1 S = Sd 2 .
Case 1. For |S| < |d 1 |, d 1 = SX for nonempty X and d 1 S = SXS = Sd 2 implies d 2 = XS is a cyclic permutation of d 1 = SX.
Case 2. For |S| = |d 1 |, d 1 = d 2 = S.
Case 3. For |S| > |d 1 |, S = d 1 X for nonempty X with |X| < |S|, and d 1 d 1 X = d 1 S = Sd 2 = d 1 Xd 2 implies d 1 X = Xd 2 . Repeating this argument for d 1 X = Xd 2 eventually reduces X to one of the previous two cases.
Another case for deletion is when R is periodic with period |d 1 |, having the form
R = d 1 d 1 …….. d 1 (d 1 )1 where (d 1 )1 is the first symbol of d 1 . Then every consecutive subsequence d 2 of R with |d 1 | = |d 2 | is an equivalent deletion, and d 2 is a cyclic permutation of d 1 . (Q.E.D).
Corollary. For |S| > |d 1 |, S must have the form d 1 d 1 …….*…… d 2 d 2 with an equal number of d 1 s and d 2s.
Figure 1 illustrates Theorem 1 with two examples. Based on the theorem, an algorithm called UPS-indel (see Table 2) exhaustively increases the range of equivalence as far as possible in both left and right directions from a given indel position. Finally for each indel in the VCF file, the algorithm reports its range of equivalence, which is called the Universal Positioning System coordinate (UPS-coordinate). Once indels are represented by their UPS-coordinates, identifying redundant indels becomes a trivial task of string comparison (e.g., Fig. 2(A), comparison across the 8th column). Note that since UPS-indel implements Theorem 1, which characterizes indels within an equivalence region, UPS-indel is exhaustive, finding all equivalent indels.
Note that “left” and “right” cyclic permutations are equivalent – there is no difference. In line 2 of the UPS-indel algorithm (Table 2), while extracting the “pattern” from the entries of the RFE and ALT columns of the input VCF file, UPS-indel performs horizontal decompositions of the complex variants and assigns the indel part as the value of pattern. For example, suppose in the REF column of a VCF entry there is an allele “ATAA” and in the ALT column there is an allele “AG”. In this case, UPS-indel performs horizontal decompositions of the complex variants to produce two separate entries (AT → AG and AA →*meaning that there is a deletion of AA).
UPS-indel is written in C++ and can run on Linux, Windows, or Mac operating systems that have a C++ compiler. The command line version of UPS-indel is available at https://github.com/shabbir005/ups-indel. UPS-indel uses SeqAn, an open source C++ library containing efficient algorithms and data structures to analyze large genome sequences11. The input to UPS-indel is a reference chromosome sequence, a VCF file containing a list of indels, an output file name, and a flag to enable/disable horizontal decomposition, for example,
./ups_indel example/chr1.fa example/chr1.vcf example/chr1.uvcf –hd=true.
This command line produces an output file named chr1.uvcf, containing the UPS-coordinates of all the indels in chr1.vcf when horizontal decomposition is enabled. Figure 2(A) shows an example UVCF file.
The UVCF file keeps the same content/format as the VCF file, with an additional column that contains the indel’s UPS-coordinate information. The interpretation of the UPS-coordinate follows:
Symbols + and − denote insertion and deletion, respectively, followed by the base pairs inserted/deleted from the reference, and the UPS-coordinate (in square brackets).
The UPS-coordinate contains a range of positions in the square brackets representing the region of equivalence for the indel. For example, the UPS-coordinate + CTTC [62298 − 62302] means there is an insertion of CTTC at position 62298, and the same alternate sequence can be produced by inserting TTCC at position 62299, or TCCT at position 62300, and so on.
Once indels are represented by the coordinates produced by UPS-indel, one can easily identify redundant indels within one indel call set or multiple indel call sets. For example, the following command line
./ups_generate_redundant_indel_list example/chr1.uvcf example/redundant_indel_list.txt
produces a list of indel groups containing dbSNP IDs of redundant indels (Fig. 2(B)).
UPS-indel groups all redundant indels together. For example, consider a group [rs34748242, rs59148039] with the UVCF entry shown in Table 3. These two indels belong to the same indel type (insertion), have same base pairs inserted (TG), and share the same UPS-coordinate and hence they are considered as equivalent.
UPS-indel can compare multiple indel call sets. This utility is particularly useful for generating a high-confidence indel call set by taking the intersection of the results of different indel callers12, or merging the indel calling results from different tools for a consensus variant caller13, or comparing indel call sets generated by different indel callers to determine their relative recall, precision, and accuracy, and to understand the source of their dissimilarities. To use this utility of UPS-indel, after converting two VCF files to UVCF files, one can use the following command to get the comparison result (Fig. 2(C)), which contains useful statistics for downstream analysis:./ups_compare_uvcf_files example/sample1.uvcf example/sample2.uvcf example/comparison_result.txt
A utility for UPS-indel can produce a filtered UVCF file after removing redundant indels. The following command, for example, can get the filtered UVCF file named out_filtered.uvcf containing nonredundant indels:
java GenerateFilteredUVCFFileAfterRemovingRedundantIndel example/out.uvcf example/redundant_indel_list.txt
UPS-indel is compared with other existing tools that also find equivalent indels through variant normalization. These tools include vt normalize (version 0.5)2, BCFtools (version 1.3)8, and GATK LeftAlignAndTrimVariants (version 3.5)9. Like UPS-indel, all of these tools take a VCF file and the reference genome as input and produce the normalized position of the indels in the VCF file. Another tool Vindel4 also finds equivalent indels using a heuristic approach, but was not included in the comparison as it uses a flat file as input instead of a VCF file.
A VCF file of dbSNP (version 142, GRCh37p13) and the GRCh37 reference genome were used as the inputs to these tools. The VCF file contains both SNPs and indels, and VCFtools14 (Version 0.1.14) is used to extract indels from the VCF file. The comparison was extended to the COSMIC dataset as well.
There are other tools that could also be considered for comparison. Both VarMatch3 and RTGTools15 use a branch and bound algorithm to search for equivalent indels. They are not suitable for processing population-scale indel call sets such as dbSNP and COSMIC because densely packed indels in such datasets make the search space too large to be processed by a branch and bound algorithm3. READDI16 considers repeat-induced ambiguities as well as tool-induced inaccuracies while searching for equivalent deletions using the longest common extension algorithm. This tool is limited to finding deletions only, and hence not included in the comparison for the dbSNP and COSMIC datasets. Nevertheless, in this study a smaller dataset is used to compare UPS-indel with VarMatch (Version available on April 5, 2017), RTGTools (Version 3.7.1), and READDI (Version available on April 5, 2017).
Results and Discussion
Finding equivalent indels in the dbSNP dataset
The input VCF file contains about 8.9 million indels from the human genome. For this input, UPS-indel produces the UVCF file and the other three tools, vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants, generate the normalized VCF file. These three tools perform left normalization of indels and output a left normalized representation. Therefore, for these three tools, two indels are equivalent if and only if they satisfy the following conditions:
Both indels are of the same type (insertion or deletion).
Both indels share the same pattern after normalization: [value of the REF column in the normalized VCF file – value of the ALT column in the normalized VCF file – value of the POS column in the normalized VCF file]. Note that one might think that considering the position should suffice, because after normalization, equivalent indels should have the same position in the VCF file. However, the example in Table 4 shows that indels rs371246544 and rs71724031 have the same normalized position but are not equivalent.
where the numerator is the total number of redundant indels reported since only one indel from each redundant indel group should be reported in the output and the remaining should be considered as redundant.
Figure 4 shows the comparison of the redundant indel ratios reported by UPS-indel, vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants for indels in the dbSNP dataset. For the entire human genome, UPS-indel identified ~15% redundant indels (see Supplementary Table 3 and Supplementary Figure 1 for chromosome-wise comparison), as compared to 11.82% by vt normalize, 11.82% by BCFtools, and 11.81% by GATK LeftAlignAndTrimVariants. At the chromosome level, UPS-indel identified about 3% more redundant indels than the other three tools.
Examining the sets of redundant indels detected by UPS-indel and the other tools shows that vt normalize and BCFtools produce exactly the same results for all chromosomes. Moreover, all the redundant indels detected by vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants are also detected by UPS-indel, as shown in Fig. 5. Further, for all chromosomes, UPS-indel identified a total of 456,352 more redundant indels than the other tools. As proved in the methods, UPS-indel identifies all the redundant indels, the comparison result shows that the other three tools are not exhaustive in finding all the redundant indels.
Why are several indels found as redundant by UPS-indel but not by other tools? An investigation shows that these equivalent indels are missed by the other tools because, due to the computation time limit, they cannot exhaustively search every cyclic permutation at every feasible position as is done by UPS-indel. For example, long multiallelic indels are not considered by default for normalization. Had the tools considered these indels separately, they would have been able to find an equivalent indel located at a different position. For this situation, UPS-indel splits the VCF entry into multiple entries by default and considers each of the indels separately while finding redundant indels. Table 5 provides such an example.
For the indel shown in Table 5 (panel A), no normalization was done by vt normalize, BCFtools, or GATK LeftAlignAndTrimVariants. UPS-indel splits the entry into three indels and finds the UPS-coordinate for each of them separately (Table 5, panel B). Splitting the VCF entry and considering the indels separately, UPS-indel managed to find another indel equivalent to one of the indels (Table 5, panel C). Therefore UPS-indel reports indels with id rs374587598 and rs60022176 as redundant.
In addition to the scenario mentioned above, GATK LeftAlignAndTrimVariants does not normalize any of the multiallelic indels regardless of the size which is also mentioned in ref.2. Table 7 shows an example of this occurrence explaining why GATKLeftAlignAndTrimVariants finds fewer number of redundant indels than vt normalize and BCFtools.
One might think that decomposing multiallelic indels into several biallelic indels produces the same results as UPS-indel for the normalization tools. To check this, the “decompose” utility of vt was used to perform a vertical decomposition of multiallelic indels into biallelic indels. Applying vt normalize to the decomposed indels could not find equivalent indels for complex variants, whereas UPS-indel is able to find the equivalent indels. Table 8 shows an example of this occurrence. Since vt normalize and BCFtools produce exactly the same results, these complex variants are missed by BCFtools as well.
In the example shown in Table 8, VCF entries for the indels with ids rs369694942 and rs373243812 remain the same in the input and the output for vt normalize (Panel A), i.e., no normalization is done. Here the second indel (rs373243812) is a complex variant containing both a SNP (A → G) and a deletion of length three (AAA), and is ignored by vt normalize. However, UPS-indel performs a horizontal decomposition of the complex variant to produce two separate entries (AA → AG and AAA →*) and finds the equivalent indel with id rs369694942 having a deletion of length three (AAA) in the UPS-coordinate 2273132 to 2273147 (Panel B).
As defined in ref.17, complex variants come in two forms: (1) MNP and (2) clumped indel. In MNP, (1) the lengths of the reference and alternate sequences are greater than one and (2) the nucleotides involved in the two sequences differ. In clumped indel, on the other hand, there is a clumping of nearby variants and the sequences need not involve different base pairs. Table 9 shows an example of MNP and clumped indel found in dbSNP (version 142).
UPS-indel decomposes complex variants based on the best predicted outcome using the Needleman-Wunsch algorithm. This approach was used by Li et al.18 to decompose complex variants into individual events.
Finding equivalent indels in the COSMIC dataset
UPS-indel was used to find redundant indels in the COSMIC (Catalogue Of Somatic Mutations In Cancer) dataset, the world’s most comprehensive resource for exploring the impact of somatic mutations in human cancer19. With data collected for more than 2,500 human cancers, this archive describes millions of coding mutations, noncoding mutations, and other gene expression variants across the human genome.
For all chromosomes in the COSMIC dataset, UPS-indel identified 28.17% and 13.11% redundant indels in the COSMIC coding and noncoding indel datasets, respectively, which are higher than the redundant indel ratios reported by the other tools. Figure 6 shows the comparison of the redundant indel ratios reported by UPS-indel, vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants for both the COSMIC coding and noncoding datasets. Comparisons for chromosome-wise redundant indel ratios among the tools are given in Supplementary Materials (See Table 4 and Fig. 2 for COSMIC coding and Table 5 and Fig. 3 for noncoding indels).
Similarly, examining the sets of redundant indels identified by the tools, Fig. 7 shows that for both the COSMIC coding and noncoding indels, UPS-indel identified all the redundant indels detected by the other tools. In addition to that, for the whole genome, 2,118 (Fig. 6A) and 553 (Fig. 6B) unique redundant indels for COSMIC coding and noncoding indels, respectively, are detected by UPS-indel but missed by other tools.
As for dbSNP, the reason why some COSMIC coding and noncoding indels were considered as redundant by UPS-indel but missed by other tools is that, in the normalized VCF for these other tools, redundant indels must contain the same pattern: [value of the REF column in the normalized VCF file – value of the ALT column in the normalized VCF file – value of the POS column in the normalized VCF file]. The reason for this pattern match restriction was given earlier. In Table 10, all tools except UPS-indel missed the indel with id COSM5068028 in the redundant indel group consisting of indels with id COSM3732389 and id COSM5348791, because of not having the same pattern. Therefore it might be assumed that only normalized position should be considered to group them together. However, then the indel with id COSM3685916 would be placed in the same group, although it is a deletion type indel whereas the others are insertion type indels, and also the resultant sequences are different. UPS-indel groups the indels correctly by placing indels with ids COSM5068028, COSM3732389, and COSM5348791 in the same redundant indel group as they have the same base pair inserted, have the same region of equivalence, and also are of the same indel type.
GATKLeftAlignAndTrimVariants found fewer redundant indels than other tools because GATKLeftAlignAndTrimVariants does not consider very large indels for normalization. For example, the indels with ids COSM5196837 and COSM5066846, which are deletions of length 371 bps and 222 bps, respectively, are not considered by GATKLeftAlignAndTrimVariants for normalization. The reason is that GATK LeftAlignAndTrimVariants uses 200 bps as the default size of the sliding window on the reference (the parameter–reference_window_stop) while left aligning the alleles which is smaller than the length of the missed deletions.
These tools are also compared in terms of average running time taken to process the whole genome VCF file of dbSNP (version 142, size 16.7GB) for normalization (by vt normalize, BCFtools, and GATKLeftAlignAndTrimVariants) or for generating the UPS-coordinate (by UPS-indel). All tools were run on a desktop computer having an Intel Core i7-2600 CPU with eight cores (at 3.40 GHz) and 16GB of RAM. Among these tools vt normalize is the fastest, taking 12 minutes 27 seconds, followed by BCFtools (12 minutes 35 seconds), UPS-indel (17 minutes 10 seconds), and GATK LeftAlignAndTrimVariants (21 minutes 55 seconds). Since UPS-indel searches for equivalent indels exhaustively and is theoretically rigorous, the computation time is a little higher than that for other heuristic normalization tools such as vt normalize and BCFtools.
Evaluating UPS-indel’s performance in comparing different indel call sets
In genomic research related to indel calling, an important step in downstream analysis is to compare multiple indel call sets for (1) generating a highly accurate benchmark indel call set by taking the intersection of multiple call sets as done by Zook et al.12 for the sample NA12878, (2) merging the call sets of different indel callers in a consensus caller as done by Trubetskoy et al.13 for exome data, and (3) evaluating the accuracy of a newly proposed indel calling tool by comparing its indel call set with the benchmark call set. Comparing different indel call sets is also a common step in studies comparing the performance of different indel callers as done in refs5,20,21. Different indel callers having different representations of the same indel complicates the comparison of different indel call sets. In addition to strict matching of indels, as mentioned earlier, a naïve but previously commonly used approach to compare multiple indel calling results is based on a simple distance criterion, that is, indels are considered to be equivalent if they are within a distance threshold (e.g., ± 5 bp or ± 25 bp). For example, the original 1000 Genomes project used ± 25 bp to compare multiple indel calling results6. To illustrate the advantage of using a UVCF file instead of a distance criterion or normalized VCF for comparing multiple VCF files, the alignment file for chromosome 11 of a single sample (HG00851) was picked up from the 1000 Genomes project and five indel callers: Dindel22 (Version 1.0.1), GATK Unified Genotyper9 (Version 3.4), GATK Haplotype Caller (Version 3.4), Platypus23 (Version 0.7.9.1), and Pindel24 (Version 0.2.5) were used to produce VCF files for indels. The resultant VCF files were compared to determine the number of common indels from these five tools using three different approaches, namely a distance based approach, comparing the VCF files normalized by vt normalize and GATK LeftAlignAndTrimVariants, and comparing the UVCF files produced by UPS-indel. For the distance based approach, two indels are considered equivalent if (1) they belong to the same indel type (either both are insertion type or both are deletion type), (2) have the same base pairs inserted/deleted, and (3) are in close proximity (within ± 5 bps from each other). For the normalized VCF files and UVCF files, the same approach was used as discussed earlier for finding redundant indels.
First the VCF files produced by the five indel calling tools were compared to find overlap among them to determine the number of common indels using the distance based approach. In the second step, the VCF files of the five indel calling tools were normalized using vt normalize and GATK LeftAlignAndTrimVariants separately. For this sample, both normalization tools produced the same normalized VCF files. The normalized VCF files of five indel calling tools were compared to determine the common number of indels. Finally, UPS-indel was used to produce the UVCF files for the five indel calling tools and these UVCF files were compared to determine the common number of indels.
The result shows that the distance based approach found 584 indels in common from the five indel calling tools while 5,514 and 5,575 common indels were found by the normalized VCF and UPS-indel UVCF approaches, respectively. This demonstrates the better suitability of UPS-indel, compared to distance based or existing normalization based approaches, for comparing multiple VCF files. Further investigation revealed that indels that are missed by the normalization tools are complex variants that are skipped by normalization tools but processed by UPS-indel. Note that this small number (61) of common indels identified by UPS-indel, but missed by the normalization tools, is based on a single chromosome of a single sample only, and much better performance of UPS-indel would be expected for the whole genome, as observed for the dbSNP and COSMIC datasets.
As mentioned earlier, the tools VarMatch3, RTG Tools15, and READDI16 are also used for comparing indel call sets. However, VarMatch and RTG Tools, which use a branch and bound algorithm, are not suitable for population-scale indel call sets like dbSNP and COSMIC due to densely packed indels in those call sets3. READDI processes deletions only. These tools are compared with UPS-indel (using the deletion call set of Platypus containing 14,438 deletions for chromosome 11 of the above mentioned single sample from the 1000 Genomes project as the baseline) on the deletion call sets of Dindel, GATK Unified Genotyper, GATK Haplotype Caller, and Pindel as the query call set to check overlap with the baseline. Table 11 shows the comparison for finding the number of true positives.
Table 11 shows that in all cases except for Pindel, UPS-indel finds more common indels than the state-of-the-art tools when comparing multiple indel call sets. These tools are heuristic and therefore ignore indels that violate a particular heuristic criterion. For example, READDI searches for equivalent indels in an indel’s neighboring region defined by the neighborhood size, and RTG Tools uses a cutoff strategy when the search space is too large. UPS-indel, on the other hand, exhaustively searches for and finds all equivalent indels, thus finds more common indels than the aforementioned tools. Why there are true positives identified by RTG Tools but missed by UPS-indel is because the current version of UPS-indel doesn’t consider haplotypes formed by the combination of neighboring indels. Consider the following example in Table 12.
In the Table 12 example, D3 is equivalent to D1D2. UPS-indel doesn’t consider the combination of D1 and D2 but rather considers them separately. Therefore, UPS-indel is not able to discover the aforementioned equivalence of the resultant haplotypes. This explains why UPS-indel missed 57 out of 9,893 deletions (where the deletions in a combination were separated by at most 45 bps) in the Pindel call set that are identified as equivalent by RTG Tools as shown in Table 11. Nevertheless there are 103 indels that are found as true positive by UPS-indel but missed by RTG Tools.
This paper describes UPS-indel, a user friendly tool that creates a universal positioning system called UPS-coordinates for all indels listed in a VCF file, and exhaustively finds all equivalent indels. The UPS-coordinate is a range of positions where all indels equivalent to a specific indel can occur. Since equivalent indels produce the same mutant sequence and thus have the same biological effect, reporting them as separate indels causes data redundancy and may artificially inflate the statistics of indel variations. Under the proposed universal positioning system, all equivalent indels have the same UPS-coordinate which avoids possible annotation ambiguity. Therefore, by checking the UPS-coordinate, one can easily filter out redundant indels from variant databases. UPS-indel is robust enough to handle complex variants and is able to detect more redundant indels than the currently existing approaches. UPS-indel could be widely used for easy and accurate systematic comparison of indels generated by different indel calling programs or deposited in databases. By eliminating the indel redundancy issue, this work offers the community the proposed universal positioning system to represent indels (so as to avoid ambiguity), which can greatly improve various downstream genomic analyses related to indels.
Availability of data and materials
The latest version of dbSNP VCF file can be found here: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/. VCF file for the COSMIC coding mutation is available at http://grch37-cancer.sanger.ac.uk/cosmic/files?data=/files/grch37/cosmic/v78/CosmicCodingMuts.vcf.gz and noncoding mutation dataset is available at http://grch37-cancer.sanger.ac.uk/cosmic/files?data=/files/grch37/cosmic/v78/CosmicNonCodingVariants.vcf.gz. All of these VCF files contain SNPs, Indels, and other types of genetic variants. To extract only indels, we used VCFtools which is available at http://vcftools.sourceforge.net/. The command line version of UPS-indel is available at https://github.com/shabbir005/ups-indel with the instruction of how to install and use UPS-indel.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Assmus, J., Kleffe, J., Schmitt, A. O. & Brockmann, G. A. Equivalent indels–ambiguous functional classes and redundancy in databases. PLoS One 8, e62803 (2013).
Tan, A., Abecasis, G. R. & Kang, H. M. Unified representation of genetic variants. Bioinformatics 31, 2202–2204 (2015).
Sun, C. & Medvedev, P. VarMatch: robust matching of small variant datasets using flexible scoring schemes. Bioinformatics, btw797 (2016).
Li, Z., Wu, X., He, B. & Zhang, L. Vindel: a simple pipeline for checking indel redundancy. BMC Bioinformatics 15, 359 (2014).
Hasan, M. S., Wu, X. & Zhang, L. Performance evaluation of indel calling tools using real short-read data. Human Genomics 9, 1–14 (2015).
Consortium, T. G. P. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Fang, H. et al. Indel variant analysis of short-read sequencing data with Scalpel. bioRxiv, 028050 (2015).
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43, 491–498 (2011).
Krawitz, P. et al. Microindel detection in short-read sequence data. Bioinformatics 26, 722–729 (2010).
Döring, A., Weese, D., Rausch, T. & Reinert, K. SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinformatics 9, 11 (2008).
Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nature Biotechnology 32, 246–251 (2014).
Trubetskoy, V. et al. Consensus Genotyper for Exome Sequencing (CGES): improving the quality of exome variant genotypes. Bioinformatics, btu591 (2014).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Cleary, J. G. et al. Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines. bioRxiv, 023754 (2015).
Wittler, R., Marschall, T., Schönhuth, A. & Mäkinen, V. Repeat-and error-aware comparison of deletions. Bioinformatics 31, 2947–2954 (2015).
Variant Classification, http://genome.sph.umich.edu/wiki/Variant_classification (2014).
Li, H. Towards better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics, btu356 (2014).
Forbes, S. A. et al. COSMIC: exploring the world’s knowledge of somatic mutations in human cancer. Nucleic Acids Research 43, D805–D811 (2015).
Neuman, J. A., Isakov, O. & Shomron, N. Analysis of insertion–deletion from deep-sequencing data: software evaluation for optimal detection. Briefings in Bioinformatics 14, 46–55 (2013).
Highnam, G. et al. An analytical framework for optimizing variant discovery from personal genomes. Nature Communications 6 (2015).
Albers, C. A. et al. Dindel: accurate indel calls from short-read data. Genome research 21, 961–973 (2011).
Rimmer, A. et al. Integrating mapping-, assembly-and haplotype-based approaches for calling variants in clinical sequencing applications. Nature genetics 46, 912–918 (2014).
Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865–2871 (2009).
Chen, H. & Boutros, P. C. VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R. BMC Bioinformatics 12, 1 (2011).
The authors thank anonymous reviewers, R. Lim from Memorial Sloan Kettering Cancer Centre, and S. Tithi and V. Vijayan from Virginia Tech for their constructive comments and suggestions. This material is based on research sponsored by Air Force Research Laboratory under agreement number FA8650-09-2-3938. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory or the U.S. Government. Publication of this article was funded by the Virginia Tech’s Open Access Subvention Fund.