UPS-indel: a Universal Positioning System for Indels

Storing biologically equivalent indels as distinct entries in databases causes data redundancy, and misleads downstream analysis. It is thus desirable to have a unified system for identifying and representing equivalent indels. Moreover, a unified system is also desirable to compare the indel calling results produced by different tools. This paper describes UPS-indel, a utility tool that creates a universal positioning system for indels so that equivalent indels can be uniquely determined by their coordinates in the new system, which also can be used to compare different indel calling results. UPS-indel identifies 15% redundant indels in dbSNP, 29% in COSMIC coding, and 13% in COSMIC noncoding datasets across all human chromosomes, higher than previously reported. Comparing the performance of UPS-indel with existing variant normalization tools vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants shows that UPS-indel is able to identify 456,352 more redundant indels in dbSNP; 2,118 more in COSMIC coding, and 553 more in COSMIC noncoding indel dataset in addition to the ones reported jointly by these tools. Moreover, comparing UPS-indel to state-of-the-art approaches for indel call set comparison demonstrates its clear superiority in finding common indels among call sets. UPS-indel is theoretically proven to find all equivalent indels, and thus exhaustive.


Background
Indel stands for insertion or deletion of bases in a DNA sequence. As the second most common form of genetic variation, indels play an important role in genome and protein evolution. Due to artificial factors such as sequencing errors, ambiguous alignment of the reads, inconsistent ways of representing the same variant by different tools, the same mutation may be recognized as distinct variations occurring at different locations [1][2][3]. For example, consider a reference sequence AGGAAAGAAAGAAAGAAAGAG ranging from position 100285630 to 100285650 and two indels stored in dbSNP, rs147659011 (GAAA/+) and rs60376183 (AAGA/+), annotated to this region with positions 100285632 and 100285650, respectively. Although these indel mutations may indeed occur at different positions, they are biologically equivalent because they result in the same altered sequence AGGAAAGAAAGAAAGAAAGAAAGAG. Since many databases such as dbSNP, Database of Genomic Variants (DGV), and Ensembl combine indels resulting from large-scale studies, similar cases often exist in those databases, leading to a nonnegligible problem of data redundancy. In fact, about 10% [4] of the human indels stored in dbSNP and 18% [1] in Ensembl are redundant. Resolving the indel redundancy in major databases is important for subsequent genetics research. Nevertheless, this problem has not been given the attention it deserves.
Numerous approaches have been developed for systematic comparison of indels to determine equivalence and hence solve the redundancy problem. The "strict matching" approach matches two indels if they share the same position, reference, and alternate alleles in two different entries in the VCF file. However, as demonstrated in [3], this approach fails to find equivalent indels that are not identical. The "distance based approach" treats two indels as equivalent if both have the same length 3 and occur within a certain distance such as ± 5 bp [5] or ± 25 bp [6]. However, this approach introduces false positives when neighboring indels are not equivalent [1] and misses equivalent indels that are farther apart than the distance cutoff. Clearly, selection of an optimal distance cutoff is a tradeoff of the two types of errors: smaller distance cutoffs result in a decreased false positive rate but an increased false negative rate.
To address the limitations of the two aforementioned approaches, the more widely used "normalization" approach attempts to solve the indel redundancy problem by left (or right) normalization, i.e., consistently shifting the start position of an indel to the left (or right) as long as the resulting sequence is the same as the one generated by the original mutation [7]. Tools using this type of variant normalization include vt normalize [2], BCFtools [8], and GATK LeftAlignAndTrimVariants [9]. These tools usually take a VCF file as input, output another VCF file with canonical VCF entries for the indels after normalization, and then perform "strict matching" to find equivalent indels with exactly the same canonical representation. The normalization approach generally performs well in identifying equivalent indels, but as shown here, fails to normalize complex variants.
The positions of indels may get changed after left/right normalization, potentially misleading downstream analysis. For example, the deletion rs536379477 resides in the exon of the transcript ENST00000590192.1, but the equivalent deletion rs41436444 is in the intron of the same transcript.
Therefore reporting these two indels with the same normalized position might lead to missing significant insight into genetic diseases or phenotypes of interest. Since the exact positions of most indel variations are not known, it is thus best to represent the indel of interest with a range of positions, within which equivalent indels can occur, rather than as a single normalized position. A similar idea was proposed by Krawitz et al. [10]. This paper proposes UPS-indel, a universal positioning system for indels, whereby every indel variant is represented by a range of positions within which all equivalent indels can occur. This representation is added to the VCF file resulting in a UVCF file containing not only the original indel calling results, but also the complete representation of all equivalent indels. The advantage of adding this column of information to the existing VCF file is (1) the original VCF file structure is unchanged so the UVCF file is still compatible with many downstream programs, (2) the UPS-indel notation facilitates the comparison of indels from different VCF files, (3) for equivalent indels that overlap both coding and noncoding regions, having the range column in the indel calling output would allow a downstream indel annotation system to consider the range rather than a single position, possibly annotating both a coding and noncoding variant. In summary, this work extends the previous work of Krawitz et al. [10] and Assmus et al. [1] by a new coordinate system Universal Positioning System (UPS), a rigorous mathematical proof that all (deletion and insertion) equivalent indels are found, the handling of complex variants, and a simple modification of an input VCF file to produce an output UVCF file containing the indel equivalence information. Results show that UPS-indel identifies more redundant indels than the existing approaches, also enables a comparison between indel calling results produced by different indel callers, and performs better than other state-of-the-art approaches for finding indels in common among call sets.

Materials and Methods
This section defines some terms frequently used in this paper.
Alternate Sequence: A sequence that is produced by introducing a specific indel to the reference sequence at a specific position. This is also known as the mutant sequence.
Let R be the reference sequence and p be either an insertion or a deletion of a given length that occurs at a given position in the reference sequence. The alternate sequence for insertion is denoted by R' I = R + p and for deletion by R' D = R -p. Equivalent Indels: Two indels are considered equivalent if and only if they produce the same alternate sequence. Note that equivalent indels must be of the same type (insertion and deletion) and same length.
Redundant Indels: Equivalent indels that are reported as distinct entries in a VCF file are defined as redundant indels.

Region of Equivalence:
This is defined as the range of positions in the reference sequence where equivalent indels occur.

Cyclic Permutation: A permutation
can be positive (left cyclic) or negative (right cyclic). Table 1. An example of equivalent indels.

Equivalent insertions Equivalent deletions
Reference, R GTCTA Reference, R ACTGTTGTG Table 1 shows an example of equivalent indels. Observe that all equivalent indels are cyclic permutations of each other (e.g., a cyclic permutation of CT is TC and cyclic permutations of TGT are GTT and TTG) and equivalence continues until there is a mismatch (see Supplementary Table 2).
This observation leads to the following theorem. Based on the theorem, an algorithm called UPS-indel (see Table 2) exhaustively increases the range of equivalence as far as possible in both left and right directions from a given indel position.
Finally for each indel in the VCF file, the algorithm reports its range of equivalence, which is called the   [11]. The input of UPS-indel is a reference chromosome sequence, a VCF file containing a list of indels, and an output file name, for example, This command line produces an output file named chr1.uvcf, containing the UPS-coordinates of all the indels in chr1.vcf. Figure 2(A) shows an example UVCF file.
The UVCF file keeps the same content/format as the VCF file, with an additional column that contains the indel's UPS-coordinate information. The interpretation of the UPS-coordinate follows: • Symbols + and -denote insertion and deletion, respectively, followed by the base pairs inserted/deleted from the reference, and the UPS-coordinate (in square brackets).
• UPS-indel groups all redundant indels together. For example, consider a group [rs34748242, rs59148039] with the UVCF entry shown in Table 3. These two indels belong to the same indel type (insertion), have same base pairs inserted (TG), and share the same UPS-coordinate and hence they are considered as equivalent. UPS-indel can compare multiple indel call sets. This utility is particularly useful for generating a highconfidence indel call set by taking the intersection of the results of different indel callers [12], or merging the indel calling results from different tools for a consensus variant caller [13], or comparing indel call sets generated by different indel callers to determine their relative recall, precision, and accuracy, and to understand the source of their dissimilarities. To use this utility of UPS-indel, after converting two VCF files to UVCF files, one can use the following command to get the comparison result ( Figure 2(C)), which contains useful statistics for downstream analysis: ./ups_compare_uvcf_files example/sample1.uvcf example/sample2.uvcf example/comparison_result.txt All of the above mentioned utilities of UPS-indel are also available at http://bench.cs.vt.edu/upsindel/ (Figure 3). UPS-indel is compared with other existing tools that also find equivalent indels through variant normalization. These tools include vt normalize (version 0.5) [2], BCFtools (version 1.3) [8], and GATK LeftAlignAndTrimVariants (version 3.5) [9]. Like UPS-indel, all of these tools take a VCF file and the reference genome as input and produce the normalized position of the indels in the VCF file.
Another tool Vindel [4] also finds equivalent indels using a heuristic approach, but was not included in the comparison as it uses a flat file as input instead of a VCF file.
A VCF file of dbSNP (version 142, GRCh37p13) and the GRCh37 reference genome were used as the inputs to these tools. The VCF file contains both SNPs and indels, and VCFTools [14] is used to extract indels from the VCF file. The comparison was extended to the COSMIC dataset as well.
There are other tools that could also be considered for comparison. Both VarMatch [3] and RTGTools [15] use a branch and bound algorithm to search for equivalent indels. They are not suitable for processing population scale indel call sets such as dbSNP and COSMIC because densely packed indels in such datasets make the search space too large to be processed by a branch and bound algorithm. READDI [16] considers repeat-induced ambiguities as well as tool-induced inaccuracies while searching for equivalent deletions using the longest common extension algorithm.
This tool is limited to finding deletions only, and hence not included in the comparison for the dbSNP and COSMIC datasets. Nevertheless, in this study a smaller dataset is used to compare UPS-indel with VarMatch (Version available on April 5, 2017), RTGTools (Version 3.7.1), and READDI (Version available on April 5, 2017).

Finding equivalent indels in the dbSNP dataset
The input VCF file contains about 8.9 million indels from the human genome. For this input, UPSindel produces the UVCF file and the other three tools, vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants, generate the normalized VCF file. These three tools perform left normalization of indels and output a left normalized representation. Therefore, for these three tools, two indels are equivalent if and only if they satisfy the following conditions: (1) Both indels are of the same type (insertion or deletion).  Table 4 shows that indels rs371246544 and rs71724031 have the same normalized position but are not equivalent. The comparison is based on the criterion: the redundant indel ratio = where the numerator is the total number of redundant indels reported since only one indel from each redundant indel group should be reported in the output and the remaining should be considered as redundant.     For the indel shown in Table 5 (panel A), no normalization was done by vt normalize, BCFtools, or GATK LeftAlignAndTrimVariants. UPS-indel splits the entry into three indels and finds the UPScoordinate for each of them separately (Table 5, panel B). Splitting the VCF entry and considering the indels separately, UPS-indel managed to find another indel equivalent to one of the indels (Table 5, panel C). Therefore UPS-indel reports indels with id rs374587598 and rs60022176 as redundant.
The example in Table 5 is for insertion; an example for deletion is illustrated in Table 6. In addition to the scenario mentioned above, GATK LeftAlignAndTrimVariants does not normalize any of the multiallelic indels regardless of the size which is also mentioned in [2]. Table 7 shows an example of this occurrence explaining why GATKLeftAlignAndTrim finds fewer number of redundant indels than vt normalize and BCFtools. One might think that decomposing multiallelic indels into several biallelic indels produces the same results as UPS-indel for the normalization tools. To check this, the "decompose" utility of vt was used to perform a vertical decomposition of multiallelic indels into biallelic indels. Applying vt normalize to the decomposed indels could not find equivalent indels for complex variants, whereas UPS-indel is able to find the equivalent indels. Table 8 shows an example of this occurrence. Since vt normalize and BCFtools produce exactly the same results, these complex variants are missed by BCFtools as well. In the example shown in Table 8

UPS-indel was used to find redundant indels in the COSMIC (Catalogue Of Somatic Mutations In
Cancer) dataset, the world's most comprehensive resource for exploring the impact of somatic mutations in human cancer [18]. With data collected for more than 2,500 human cancers, this archive describes millions of coding mutations, noncoding mutations, and other gene expression variants across the human genome.
For all chromosomes in the COSMIC dataset, UPS-indel identified 28.17% and 13.11% redundant indels in the COSMIC coding and noncoding indel datasets, respectively, which are higher than the redundant indel ratios reported by the other tools. Figure 6 shows the comparison of the redundant indel ratios reported by UPS-indel, vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants for both the COSMIC coding and noncoding datasets. Comparisons for chromosome-wise redundant indel ratios among the tools are given in Supplementary Materials (See Table 4 and Figure 2 for COSMIC coding and Table 5 and Figure 3 for noncoding indels).
Similarly, examining the sets of redundant indels identified by the tools, Figure 7 shows that for both the COSMIC coding and noncoding indels, UPS-indel identified all the redundant indels detected by the other tools. In addition to that, for the whole genome, 2,118 ( Figure 6A) and 553 ( Figure 6B) unique redundant indels for COSMIC coding and noncoding indels, respectively, are detected by UPS-indel but missed by other tools.
As for dbSNP, the reason why some COSMIC coding and noncoding indels were considered as redundant by UPS-indel but missed by other tools is that, in the normalized VCF for these other tools, redundant indels must contain the same pattern: [value of the REF column in the normalized VCF file -value of the ALT column in the normalized VCF file -value of the POS column in the normalized VCF file]. The reason for this pattern match restriction was given earlier. In Table 9, all tools except UPS-indel missed the indel with id COSM5068028 in the redundant indel group consisting of indels with id COSM3732389 and id COSM5348791, because of not having the same pattern. Therefore it might be assumed that only normalized position should be considered to group them together.
However, then the indel with id COSM3685916 would be placed in the same group, although it is a deletion type indel whereas the others are insertion type indels, and also the resultant sequences are different. UPS-indel groups the indels correctly by placing indels with ids COSM5068028, COSM3732389, and COSM5348791 in the same redundant indel group as they have the same base pair inserted, have the same region of equivalence, and also are of the same indel type.

GATKLeftAlignAndTrimVariants found fewer redundant indels than other tools because
GATKLeftAlignAndTrimVariants does not consider very large indels for normalization. For example, the indels with ids COSM5196837 and COSM5066846, which are deletions of length 371bps and 222 bps, respectively, are not considered by GATKLeftAlignAndTrimVariants for normalization. The reason is that GATK LeftAlignAndTrimVariants uses 200 bps as the default size of the sliding window on the reference (the parameter --reference_window_stop) while left aligning the alleles which is smaller than the length of the missed deletions. These tools are also compared based on the average running time taken to process the input VCF file for normalization (by vt normalize, BCFtools, and GATKLeftAlignAndTrimVariants) or for generating the UPS-Coordinate (by UPS-indel). All tools were run on a desktop computer having an Intel Core i7-2600 CPU with eight cores (at 3.40 GHz) and 16GB of RAM. Table 10 shows the average running time for chromosome 1 of the dbSNP VCF file. Among these tools BCFtools is the fastest taking 6 seconds followed by vt normalize (6.18 seconds), GATK LeftAlignAndTrimVariants (17.22 seconds), and UPS-indel (35.22 seconds), which is the slowest. Since UPS-indel searches for equivalent indels exhaustively and is theoretically rigorous, the computation time is not surprisingly higher than that for other heuristic normalization tools.

Evaluating UPS-indel's performance in comparing different indel call sets
In genomic research related to indel calling, an important step in downstream analysis is to compare multiple indel call sets for (1) generating a highly accurate benchmark indel call set by taking the intersection of multiple call sets as done by Zook et al. [12] for the sample NA12878, (2) merging the call sets of different indel callers in a consensus caller as done by Trubetskoy et al. [13] for exome data, and (3) evaluating the accuracy of a newly proposed indel calling tool by comparing its indel call set with the benchmark call set. Comparing different indel call sets is also a common step in studies comparing the performance of different indel callers as done in [5], [19], and [20]. Different indel callers having different representations of the same indel complicates the comparison of different indel call sets. In addition to strict matching of indels, as mentioned earlier, a naïve but previously commonly used approach to compare multiple indel calling results is based on a simple distance criterion, that is, indels are considered to be equivalent if they are within a distance threshold (e.g., ±5bp or ±25bp).
For example, the original 1000 Genomes project used ±25bp to compare multiple indel calling results [6] . To illustrate the advantage of using a UVCF file instead of a distance criterion or normalized VCF for comparing multiple VCF files, the alignment file for chromosome 11 of a single sample (HG00851) was picked up from the 1000 Genomes project and five indel callers: Dindel [21], GATK Unified Genotyper [9], GATK Haplotype Caller, Platypus [22], and Pindel [23] were used to produce VCF files for indels. The resultant VCF files were compared to determine the number of common indels from these five tools using three different approaches, namely a distance based approach, comparing the VCF files normalized by vt normalize and GATK LeftAlignAndTrimVariants, and comparing the UVCF files produced by UPS-indel. For the distance based approach, two indels are considered equivalent if (1) they belong to the same indel type (either both are insertion type or both are deletion type) , (2) have the same base pairs inserted/deleted, and (3) are in close proximity (within ±5 bps from each other). For the normalized VCF files and UVCF files, the same approach was used as discussed earlier for finding redundant indels.
First the VCF files produced by the five indel calling tools were compared to find overlap among them to determine the number of common indels using the distance based approach. In the second step, the VCF files of the five indel calling tools were normalized using vt normalize and GATK LeftAlignAndTrimVariants separately. For this sample, both normalization tools produced the same normalized VCF files. The normalized VCF files of five indel calling tools were compared to determine the common number of indels. Finally, UPS-indel was used to produce the UVCF files for the five indel calling tools and these UVCF files were compared to determine the common number of indels.
The result shows that the distance based approach found 584 indels in common from the five indel calling tools while 5,514 and 5,575 common indels were found by the normalized VCF and UPS-indel UVCF approaches, respectively. This demonstrates the better suitability of UPS-indel, compared to distance based or existing normalization based approaches, for comparing multiple VCF files. Note that this small number (61) of common indels identified by UPS-indel, but missed by the normalization tools, is based on a single chromosome of a single sample only, and much better performance of UPS-indel would be expected for the whole genome, as observed for the dbSNP and COSMIC datasets.
As mentioned earlier, the tools VarMatch [3], RTG Tools [15], and READDI [16] are also used for comparing indel call sets. However, VarMatch and RTG Tools, which use a branch and bound algorithm, are not suitable for population-scale indel call sets like dbSNP and COSMIC due to densely packed indels in those call sets. READDI processes deletions only. These tools are compared with UPS-indel (using the deletion call set of Platypus containing 14,438 deletions for chromosome 11 of the above mentioned single sample from the 1000 Genomes project as the baseline) on the deletion call sets of Dindel, GATK Unified Genotyper, GATK Haplotype Caller, and Pindel as the query call set to check overlap with the baseline. Table 10 shows the comparison of overlaps.  Table 10 shows that UPS-indel finds more common indels than the state-of-the-art tools when comparing multiple indel call sets. These tools are heuristic and therefore ignore indels that violate a particular heuristic criterion. For example, READDI searches for equivalent indels in an indel's neighboring region defined by the neighborhood size, and RTG Tools uses a cutoff strategy when the search space is too large. UPS-indel, on the other hand, exhaustively searches for and finds all equivalent indels, thus finds more common indels than the aforementioned tools.

Conclusion
This paper describes UPS-indel, a user friendly tool that creates a universal positioning system called UPS-coordinates for all indels listed in a VCF file, and exhaustively finds all equivalent indels.
The UPS-coordinate is a range of positions where all indels equivalent to a specific indel can occur.
Since equivalent indels produce the same mutant sequence and thus have the same biological effect, reporting them as separate indels causes data redundancy and may artificially inflate the statistics of indel variations. Under the proposed universal positioning system, all equivalent indels have the same UPS-coordinate which avoids possible annotation ambiguity. Therefore, by checking the UPScoordinate, one can easily filter out redundant indels from variant databases. UPS-indel is robust enough to handle complex variants and is able to detect more redundant indels than the currently existing approaches. UPS-indel could be widely used for easy and accurate systematic comparison of indels generated by different indel calling programs or deposited in databases. By eliminating the indel redundancy issue, this work offers the community the proposed universal positioning system to represent indels (so as to avoid ambiguity), which can greatly improve various downstream genomic analyses related to indels.