The evaluation of Bcftools mpileup and GATK HaplotypeCaller for variant calling in non-human species

Identification of genetic variations is a central part of population and quantitative genomics studies based on high-throughput sequencing data. Even though popular variant callers such as Bcftools mpileup and GATK HaplotypeCaller were developed nearly 10 years ago, their performance is still largely unknown for non-human species. Here, we showed by benchmark analyses with a simulated insect population that Bcftools mpileup performs better than GATK HaplotypeCaller in terms of recovery rate and accuracy regardless of mapping software. The vast majority of false positives were observed from repeats, especially for GATK HaplotypeCaller. Variant scores calculated by GATK did not clearly distinguish true positives from false positives in the vast majority of cases, implying that hard-filtering with GATK could be challenging. These results suggest that Bcftools mpileup may be the first choice for non-human studies and that variants within repeats might have to be excluded for downstream analyses.

www.nature.com/scientificreports/ Filtering potential false-positive SNVs is an essential step in variant calling. In Bcftools mpileup, filtering can be performed readily from variant calling score, which is a phred-scaled probability of false variant calling. GATK HaplotypeCaller provides two ways of filtering. The first option is hard filtering, which discards SNVs if variant scores are lower or higher than certain thresholds, which are typically determined by humans. These variant scores include QD (the variant quality score normalized by read-depth), FS (the phred-scaled probability of strand bias), SQR (strand bias calculated by symmetric odds ratio test), MQ (the root mean square of mapping quality of reads mapped to a nucleotide), MQRankSum (the normalized difference in mapping qualities between reference and alternative alleles), and ReadPosRankSum (the normalized difference in position between reference and alternative alleles within the reads) 5 . The second option is VQSR (Variant Quality Score Recalibration). In this method, the threshold of variant scores is determined by machine learning from truth and training resources generated from a reference SNV data set. Such truth and training resources do not exist in the vast majority of non-human species.
In this study, we compared the performance of variant calling from P with varying d in an insect population between GATK HaplotypeCaller and Bcftools mpileup using benchmark analyses. Since a reference SNV data set is not available for most non-human species, we performed benchmark analysis from a simulated insect population, which mimics Drosophila melanogaster.   Fig. 2A). The recovery rates were negatively correlated with d both for Bcftools mpileup (ρ = − 0.999, p = 5.115 × 10 -5 ; Spearman's correlation test) and GATK HaplotypeCaller (ρ = − 1.000, p = 1.128 × 10 -5 ), while the range of the recovery rate is less than 1%. Bcftools mpileup had lower proportions of false positives (0.00373-0.0321%) than GATK HaplotypeCaller (1.759-1.959%) by 54-521 times. Bcftools-mpileup had a positive correlation between the proportion of false positives and d (ρ = 1.000, p = 0.0004293), while this trend was not observed from GATK HaplotypeCaller (ρ = − 0.858, p = 0.06269). As a consequence, the ratio of the proportion of false positives from Bcftools mpileup to GATK HaplotypeCaller was positively correlated with d (ρ = 1.000, p = 1.53 × 10 -5 ).
When d equals zero, the majority of false positives were observed from repeats (78.91% and 85.24% of total false positives for Bcftools mpileup and GATK HaplotypeCaller, respectively) in line with a previous study 6 . GATK HaplotypeCaller generated a higher proportion of false positives at repeats among total false positives than Bcftools mpileup by 54% (p = 0.03585; two-tailed fisher's exact test). GATK HaplotypeCaller generated a particularly high number of false positives at LTR transposons and LINE transposable elements ( Table 2). In addition, GATK HaplotypeCaller generated disproportionally abundant false positives at simple repeats, low complexities, and satellites.
Then, we tested the effect of filtering. When SNVs with variant calling scores lower than 20 were discarded from the SNV data of Bcftools mpileup, the recovery rate ranged between 88.67 and 89.65% (Fig. 2B), which is lower than the unfiltered data set just by 0.59-0.86%. The recovery rates were negatively correlated with d (ρ = − 0.988, p = 0.001551). The proportion of false positives ranged between 0.0023 to 0.0286%, which is lower than the unfiltered data set by 35-42%, and it was positively correlated with d (ρ = 1.000, p = 3.854 × 10 -5 ).
For GATK-HaplotypeCaller, no variant score discriminated clearly between true positives and false positives (Fig. 3). The exception was MQ scores with low d. For example, when d = 0, 80.39% of false positives had MQ scores lower than 40, while only 4.929% of true positives had MQ scores lower than 40 (Fig. 4A). On the other hand, when d = 0.02, 93.74% of false positives and 78.56% of true positives from P had MQ scores < 40. This result implies that MQ < 40 filterings will cause a severe reduction in the recovery rate by discarding 78.56% of true positives. Interestingly, a bimodal distribution of QD scores was observed except for a case with d = 0 (Fig. 4B,  left). Each mode in this bimodal distribution appears to represent SNVs from P and D (Fig. 4B, right).  www.nature.com/scientificreports/ We performed hard-filtering according to the suggestions in the Technical Document of GATK 5 , which includes MQ < 40. GATK-HaplotypeCaller had much lower recovery rates than Bcftools mpileup (18.09-81.36%). The recovery rate rapidly decreased as d increased (ρ = − 0.984, p = 0.002369) (Fig. 2B). Furthermore, the proportion of false positives was 0.14-0.1910%, which was much higher than that of Bcftools mpileup (0.0023-0.0286%).
We also used VQSR filtering, which is based on machine learning with a training dataset to choose optimal criteria on variant scores, to perform filtering from the SNV data of GATK-HaplotypeCaller. For this purpose, we generated different sets of Illumina reads from the same 20 individuals with diploid genomes. The recovery rate was comparable (89.28-89.84%) to that of Bcftools mpileup, and it was negatively correlated with d (ρ = − 0.986, p = 0.002014). However, the proportion of false positives (0.9906-1.486%) was much higher than that of Bcftools mpileup, even higher than GATK hard filtering.

Discussion
The identification of genetic variations is a critical step in studies based on high-throughput sequencing data. Both false positive and false negative SNVs might mislead population and quantitative genomics analyses. Thus, a variant caller needs to be chosen carefully although the performance of variant callers is largely unknown in non-human species, especially in invertebrates. In this study, we compared the performance of the two most popular variant callers, Bcftools mpileup and GATK HaplotypeCaller, in a simulated insect population.

Performance of BCFtools and GATK HaplotypeCaller.
In the dataset based on the mapping by bow-tie2, when filtering was applied to the SNV data from Bcftools mpileup by discarding SNVs with variant calling scores less than 20, the proportion of false positives was reduced by 35-42% while the recovery rates were largely unchanged. When hard filtering was applied to the SNV data generated by GATK HaplotypeCaller, both the proportion of false positives and the recovery rates were substantially reduced. When VQSR filtering was used, the recovery rate was comparable with the filtering of Bcftools mpileup, while VQSR filtering had much higher proportions of false positives than Bcftools mpileup.
The effect of d is slight in all cases with a single exception of GATK-hard filtering, which had severely decreased the recovery rates when d was high. For example, when d was increased from 0 to 0.02, the recovery rate was reduced from 0.8948 to 0.1809. f statistics show that the best result can be obtained from the filtered SNV dataset from Bcftools mpileup. We also found that repeats are the main source of false positives, especially for GATK HaplotypeCaller. Taken together, Bctfools mpileup performs better than GATK HaplotypeCaller in terms of recovery rate and the correctness at both raw and filtered SNV datasets.
When bwa-mem was used for mapping, the same trend was observed with the exception that GATK had lower proportions of false positives (0.018-1.271%) than Bcftools mpileup (0.0101-2.304%) after filtering. It should be noted, however, that bowtie2 generated much lower proportions of false positives (0.0023-0.021%) than bwa-mem when Bcftools mpileup was used for variant calling.  www.nature.com/scientificreports/ When the coverage was low (5 ×) and bowtie2 was used for mapping, GATK VQSR and Bcftools mpileup had higher recovery rates, lower proportions of false positives, and higher f scores than GATK hard filtering after filtering, implying that GATK hard filtering had the lowest performance.
Taken together, these results show that Bcftools mpileup outperforms GATK HaplotypeCaller after hard filtering, regardless of the software used for mapping or the coverage of reads, especially when d was high. GATK VQSR appears to outperform Bcftools mpileup only when coverage is low according to the f score.
Potential causes of the differential performance. The different performance between Bcftools mpileup and GATK HaplotypeCaller can be due to the way of handling alignments generated by a mapper. Bcftools mpileup uses alignments of a mapper as it is. Bcftools mpileup uses mapping scores to evaluate variant calling in a way that the variant calling score of an SNV is not allowed to be higher than the mapping score. The mapping score is a Phred-scaled probability of unique mapping, by definition. If a read is mapped to a repeat, the mapping score is likely to be low because the read could be also mapped to other similar repeats. Thus, Bcftools mpileup is able to handle potential false positives at repeats by discarding ambiguous mappings using the mapping score. GATK HaplotypeCaller adjusts an alignment by taking account of reads that are mapped to overlapping sequences. This correction is based on the assumption that a read was mapped to the right region in a reference genome with a possibility of local misalignment. If a read is mapped to an incorrect region in a reference genome, then a false-positive cannot be fixed by the adjustment of an alignment. Since mis-mappings are presumably more abundant at repeats, we believe that the higher number of false positives from GATK Hap-lotypeCaller is, in a large part, due to the mismappings at repeats. It should be noted that false positives can also be enriched at multi-copy genes, duplicated chromosome segments, and even DNA motifs for protein binding for the same reason.
The relatively low performance of GATK hard filtering is potentially due to the fact that most scores do not discriminate between false positives and true positives (Fig. 3). The exception is the MQ score with low d. When d = 0, filtering by MQ < 40 effectively removed false positives while the recovery rate was largely unchanged. However, the difference in MQ scores between true positives and false positives was decreased as d increased. As a consequence, hard filtering according to MQ scores appears to reduce the recovery rate significantly.
The distribution of QD scores can be informative to determine whether SNV filtering according to MQ scores. The distribution of QD scores was different between P and D (Fig. 4) and a bimodal distribution was observed when d > 0. The distribution of QD scores appears to be determined by the spectrum of alternative allele frequencies. SNVs from D are expected to have higher alternative allele frequencies than that from P. Therefore, if a significant proportion of SNVs was generated from D, a bimodal distribution of QD can be generated. In other words, a bimodal distribution of QD scores can be a signature that a significant proportion of SNVs was generated by D. In this case, filtering according to QD score is likely to reduce the recovery rate from P severely, and VQSR is the only reliabme method of filtering. If a unimodal distribution of QD score was observed, on the other hand, filtering according to QD score will effectively reduce false positives while the recovery rate is largely unchanged. On the official GATK homepage, a bimodal distribution of QD scores is explained as, 'This is because hom-var samples contribute twice as many reads supporting the variant than do het variants' 5 . Our results show that this explanation is may mislead because the distribution of QD scores reflects the frequency of alternative alleles in a population, instead of hetero or homozygosity of each sample.
Our result may appear to contradict previous papers showing that GATK HaplotypeCaller performs better for human data than Samtools/Bcftools mpileup. A majority of these studies assessed the performance by the comparison of an SNV dataset from a variant caller and another dataset ascertained by previous experiments. As mentioned by Heng Li, this comparison 'would only give us an estimate of the relative accuracy--if two pipelines are affected by the same artifact that a third pipeline does not have, then the third pipeline will appear worse even though it is in fact better. In addition, comparative studies usually measure the accuracy with summary statistics such as the fraction of calls present in dbSNP or the transition-to-transversion ratio. They do not tell us the wrong sites' 6 . Indeed, obtaining correct SNVs positions across the whole genome is almost impossible from empirical data, thus evaluation of variant callers can be consequently biased. A list of whole true positive SNVs can be obtained only from a simulated population for unbiased comparison. However, it should be noted that any simulated population can be systematically different from a real population. Thus, we might be yet to reach a conclusion on which variant caller performs better between GATK HaplotypeCaller and Bcftools mpileup even in human studies.

The choice of a variant caller.
Here, we propose the following for variant calling in non-human species.
First, Bcftools mpileup should be considered for variant calling over GATK HaplotypeCaller in non-human studies because Bcftools mpileup may generate a lower number of false positives and because SNV filtering according to variant calling scores may reduce effectively false positives while recovery rate remains largely unchanged. Second, SNVs within repeats should be excluded for downstream analysis, especially for GATK Haplotype-Caller. Third, when GATK HaplotypeCaller is used, we strongly advise using reference genome sequences that are not highly diverged from an analyzed population. Fourth, if the QD score from GATK shows a unimodal distribution, hard-filtering according to MQ scores can effectively reduce false positives while the recovery rate is largely unchanged; Fifth, if the QD score from GATK shows a bimodal distribution, VQSR filtering should be used. If a training dataset for VQSR is not available, Bcftools mpileup can be used as an alternative. Sixth, if a researcher wants to minimize the proportion of false-positive SNVs, we recommend using bowtie2 for mapping, rather than bwa-mem.

Methods
Simulation for data generation. A forward simulation was performed to generate a population of Drosophila melanogaster using Slim2 7 . The population size was 10 6,8 . The length of simulated DNA is 110,338,342 bp, which is the total length of autosomes in BDGP Release 6 (https:// hgdow nload. soe. ucsc. edu/ golde nPath/ dm6/ bigZi ps/). The recombination rate and the mutation rate were 2.8 × 10 -9 and 2.0 × 10 -8 , respectively 9,10 . For computation feasibility, the population size, mutation rate, and recombination were rescaled to 10 3 , 2.8 × 10 -6 , and 2.0 × 10 -5 , respectively. This re-scaling essentially generates the same result of simulation since the populationscaled mutation rate and recombination rate remain unchanged. The simulation was performed during 3,000 generations (equivalent to 3 × 10 6 generations without rescaling) until we collected the genotypes of random 40 diploid individuals. The first alleles of these 40 diploid individuals were mapped to the autosomal reference genome of BDGP Release 6 with a randomly chosen alternative nucleotide sequence, followed by the generation of 20 diploid genomes by combining two haploid genomes. Simulated paired-end Illumina reads were generated from each diploid genome using neat-genreads v3.0 11 with 20 × coverage, which corresponds to 40 × coverage for a haploid reference genome. The read length was 150 bp. The average insert length was 300 bp and the standard deviation was 30 bp. The reference genomes were generated from the autosomal sequences of BDGP Release 6 by the introduction of new mutations with varying phylogenetic distances, d (0.5%, 1%, 1.5%, and 2%), using simuG v1.0.0 12 . The autosomal sequences of BDGP Release 6 were also used for the reference genome sequence. The genomic coordinates of repeat sequences within BDGP Release 6 were obtained from the table browser of UCSC Genome browser (https:// genome. ucsc. edu/ cgi-bin/ hgTab les).
Mapping. The reads were mapped against these genome references using bowtie2 v2.3.5.1 13 with the default setting of end-to-end mapping or bwa-mem2 v2.2. The resulting bam files were sorted according to the coordinates using samtools-1.9 1 . A read-group was assigned to each bam file using AddOrReplaceReadGroups function from the Picard tool v2.9.4 14 . The resulting bam files were indexed using samtools-1.9.
Variant calling and filtering. We performed haplotype calling for each bam file using the HaplotypeCaller function at GATK v4.2.0.0 2 . The resulting gvcf files were merged into a single gvcf file. Variant calling was performed from this gvcf file using GenotypeGVCFs function. Then, SNVs were collected from the resulting vcf file. Hard Filtering was performed with the following criterion: QD < 2, FS > 60, MQ < 40, MQRankSum < − 12.5, ReadPosRankSum < − 8. For VQSR filtering, we generated two sets of 40 × simulated reads from the 20 diploid genomes. Then, variant calling was performed using HaplotypeCaller and GenotypeGVCFs at GATK. The resulting two vcf files with the information of true positives and false positives were used to recalibrate variant scores, and VQSR filtering was performed from the recalibrated variant scores.
Variant calling was also performed using Bcftools v1.9 with 'bcftools mpileup -f REFERENCE LIST_OF_BAM | bcftools call -mv -Oz -o VCFFILE' command, where REFERENCE, LIST_OF_BAM, and VCFFILE are the name of the reference genome sequence, the list of bam files, and the resulting vcf file name, respectively. Then, only SNVs were collected. Filtering was performed by discarding SNVs in which the variant calling score at QUAL field is lower than 20.
Evaluation of performance. True positives and false negatives of the SNVs introduced by slim2 v2.0 were counted using custom java scripts. The recovery rate was calculated by TP/(TP + FN), where TP and FN are the numbers of true positives and false positives, respectively.
If an SNV was not found from the list of mutations generated by slim or SimuG, this SNV was identified as a false positive. The proportion of false positives was calculated by the number of false positives divided by the total number of SNVs in a vcf file. f scores were calculated by (2 × Recovery rate × Proportion of false positive)/ (Recovery rate + Proportion of false positive).

Data availability
All scripts used in this study are available at https:// github. com/ kiwoo ng-nam/ Varia nt-caller-compa rison. The simulated reads are available on request.