eVIDENCE: a practical variant filtering for low-frequency variants detection in cell-free DNA

Plasma cell-free DNA (cfDNA) testing plays an increasingly important role in precision medicine for cancer. However, circulating cell-free tumor DNA (ctDNA) is highly diluted by cfDNA from non-cancer cells, complicating ctDNA detection and analysis. To identify low-frequency variants, we developed a program, eVIDENCE, which is a workflow for filtering candidate variants detected by using the ThruPLEX tag-seq (Takara Bio), a commercially-available molecular barcoding kit. We analyzed 27 cfDNA samples from hepatocellular carcinoma patients. Sequencing libraries were constructed and hybridized to our custom panel targeting about 80 genes. An initial variant calling identified 36,500 single nucleotide variants (SNVs) and 9,300 insertions and deletions (indels) across the 27 samples, but the number was much greater than expected when compared with previous cancer genome studies. eVIDENCE was applied to the candidate variants and finally 70 SNVs and 7 indels remained. Of the 77 variants, 49 (63.6%) showed VAF of < 1% (0.20–0.98%). Twenty-five variants were selected in an unbiased manner and all were successfully validated, suggesting that eVIDENCE can identify variants with VAF of ≥ 0.2%. Additionally, this study is the first to detect hepatitis B virus integration sites and genomic rearrangements in the TERT region from cfDNA of HCC patients. We consider that our method can be applied in the examination of cfDNA from other types of malignancies using specific custom gene panels and will contribute to comprehensive ctDNA analysis.

sequence adjacent to the target is highly consistent with the reference genome, the region can be labeled as "M" with/ without "I (insertion to the reference)" or "D (deletion from the reference)" operation. This behavior can introduce sequence mismatches in the stem regions whose origins are not biological molecules.
We removed UMT and stem sequences and matched base qualities from the segment sequence and base quality fields of BAM files containing only reads covering the positions of the candidate variants as described below.
1) When a read is a forward read, the first 6 bases in the query sequence field represent UMT.
UMT sequence was kept as left UMT for later use.
2) The following stem sequence is determined uniquely by the beginning base. For example, when it starts with "A", the sequence is "AGTAGCTCA" (Supplementary Fig. S1). If the hamming distance between the expected stem sequence and the query sequence was ≥ 2, the read was discarded.
3) When a stem sequence is "AGTAGCTCA", the CIGAR value should start with "15S", due to the first 6 bases and the following 9 bases being from UMT and the stem, respectively.
However, there were several inconsistent values which we divided into the following cases: A) The CIGAR values start with "xS" and x < 15 When the first CIGAR operation was S and its length was less than 15, the length was 3 changed to 15 and 15-x was subtracted from the length of the following operation. We also added 15-x to the leftmost mapping position.
B) The CIGAR values start with "xS" and x > 15 When the first CIGAR operation was S and its length was more than 15, the length was changed to 15 and x-15 was added to the length of the following operation. We also subtracted x-15 to the leftmost mapping position. However, if there were extra "GTAGCTCA", a common sequence of four kinds of stems, in the following sequence, the region of S operation was extended to the position of the rightmost common sequence. The length of the following operation and the mapping position was changed accordingly. This duplication of stem sequences can occur during library synthesis.
C) The CIGAR values start with "xM" and x <15 Firstly, we added the lengths of M and I operations sequentially until the sum gave 15 or more. Here we defined sum of the lengths of M and I operations as sum_MI, and sum of the lengths of M and D operations as sum_MD.
When sum_MI was 15, there were two cases in which the next operation was D or M.
For example, in the former case, the CIGAR value could start with "xMyDzIwD" (sum_MI = x+z = 15, sum_MD = x+y). We changed this value to "15S" and added sum_MD+w to the mapping position. In the latter case, the CIGAR value could be "xMyDzIwM" (sum_MI = x+z = 15, sum_MD = x+y). We changed this value to "15SwM" and added sum_MD to the mapping position.
When sum_MI was more than 15, there were also two cases in which the last CIGAR operation when sum_MI exceeded 15 was M or I. For example, in the former case, the CIGAR value could be "xMyDzIwM" (sum_MI = x+z+w > 15, sum_MD = x+y+w). We changed this value to "15S(sum_MI-15)M" and added (15+sum_MD-sum_MI) to the mapping position. In the latter case, the CIGAR value could be started with "xMyDzI" (sum_MI = x+z > 15, sum_MD = x+y). We changed this value to "15S(sum_MI-15)I" 4 and added sum_MD to the mapping position.
D) The CIGAR values start with "xM" and x = 15 When the first CIGAR operation was M and its length was 15, the following operation must be D or I. Namely, the CIGAR value starts with "15MyD" or "15MyI". In the former case, we changed the M operation to S and deleted D operation. We then added 15+y to the leftmost mapping position. In the latter case, we changed the M operation to S and added 15 to the mapping position.
E) The CIGAR values start with "xM" and x > 15 When the first CIGAR operation was M and its length was more than 15, the value was changed to "15S(x-15)M". We then added 15 to the leftmost mapping position. F) A) ~ E) was performed when a stem sequence started with "T", "G" and "C" (the length of S operation should be 17, 14 and 16, respectively). 4) When a read is a reverse read, the last 6 bases in the query sequence field represent UMT and the adjacent 8-11 bases indicate a stem sequence. The same procedure mentioned above was done for reverse reads, but the leftmost mapping positions were not changed. UMT sequence was also kept as right UMT. 5) Bases and base qualities in the newly modified soft clipping regions were removed from the segment sequence and base quality fields of BAM files.
6) The leftmost and rightmost mapping positions were compared between paired reads. The rightmost mapping position was calculated by adding (sum_MD-1) to the leftmost mapping position (sum_MD: sum of the lengths of M and D operations in the revised CIGAR value).
If the rightmost position of a forward read was larger than that of a reverse read, the forward read was considered to cover the stem sequence on the 3' side. Similarly, if the leftmost position of a reverse read was smaller than that of a forward read, the reverse read was considered to cover the stem sequence on the 5' side. In these cases, bases originated from stems and matched base qualities were removed.

5
Finally, for paired reads, left and right UMT sequences extracted from forward and reverse reads were added to the read name. Using the newly created read names, segment sequences and base qualities, a new FASTQ file was produced. This file was converted into BAM format by BWA and SAMtools 2 for further filtering of candidate variants.

Filtering of candidate variants
For variants filtering, reads covering the position of each candidate variant with mapping quality ≥ 20 were selected. If three or more variants were found, or the distance between two variants was less than 10 bp in any reads, the reads were all discarded. We then extracted base calls with quality ≥ 20 and UMTs at each candidate position for single nucleotide variant (SNV) filtering. Base calls which share the same UMT were grouped into a "UMT family". In the same way, CIGAR values, MD:Z tags and UMTs were extracted for indel filtering, and CIGAR and MD:Z were coordinated into UMT families. UMT families in which the number of family members of ≤ 2 were discarded. When the number of UMT families, which represents the coverage of each candidate position, was less than 100, the candidate variant was discarded.
Next, the consensus base call or CIGAR value was determined by majority within each family. If there were two or more family members which did not support the consensus in any UMT families, we discarded the candidate. For indel filtering, the MD:Z tag was used for confirming whether the UMT family supported the candidate indel or not.
Furthermore, we checked the UMTs of families which supported candidate variants. If left or right UMT matched exactly among families supporting a variant, the candidate was discarded.
In addition, if UMT of one family was consistent with combination of left UMT and right UMT of other two families, the family was discarded as this family might be generated by the recombination of the other two families. Finally, if the number of families supporting a variant was less than three, the candidate variant was discarded.
For further filtering of candidate SNVs, candidates represented in dbSNP 6 (http://www.ncbi.nlm.nih.gov/SNP/) or the integrative Japanese Genome Variation Database (http://ijgvd.megabank.tohoku.ac.jp/) were excluded. Additionally, we examined the distribution of variant allele frequency (VAF) of candidate SNVs registered in public databases with those of other candidates (Supplementary Fig. S6). This revealed that the latter had a bimodal distribution whereby one peak existed in a region of < 11%, and the other existed in a region of > 22%, which was mostly consistent with one peak of the former one near 40%. This suggested that there was clustering of heterozygous single nucleotide polymorphisms. For this reason, we regarded SNVs with VAF of > 20% as germline variants and discarded them. After these filtering, the remaining variants were functionally annotated with ANNOVAR 3 .

Validation of the algorithm for consensus base calling
To validate the algorithm for filtering candidate variants described above, we generated an artificial library by mixing three libraries with different proportions (0.5% of RK442, 1.0% of RK443 and 98.5% of RK445). There were a total of 150 known single nucleotide polymorphisms (SNPs) that were present in either or both RK442 and RK443, but not in before final filtering. The VAF distribution of candidate SNVs that were unregistered in public databases (left) had a bimodal distribution whereby one peak existed in a region of < 11% and the other existed in a region of > 22%. The VAF distribution of candidates that are registered in public databases (right) had two peaks around 40% and at 100%, indicating heterozygous and homozygous single nucleotide polymorphisms (SNPs), respectively.