Comparison of calling pipelines for whole genome sequencing: an empirical study demonstrating the importance of mapping and alignment

Rapid advances in high-throughput DNA sequencing technologies have enabled the conduct of whole genome sequencing (WGS) studies, and several bioinformatics pipelines have become available. The aim of this study was the comparison of 6 WGS data pre-processing pipelines, involving two mapping and alignment approaches (GATK utilizing BWA-MEM2 2.2.1, and DRAGEN 3.8.4) and three variant calling pipelines (GATK 4.2.4.1, DRAGEN 3.8.4 and DeepVariant 1.1.0). We sequenced one genome in a bottle (GIAB) sample 70 times in different runs, and one GIAB trio in triplicate. The truth set of the GIABs was used for comparison, and performance was assessed by computation time, F1 score, precision, and recall. In the mapping and alignment step, the DRAGEN pipeline was faster than the GATK with BWA-MEM2 pipeline. DRAGEN showed systematically higher F1 score, precision, and recall values than GATK for single nucleotide variations (SNVs) and Indels in simple-to-map, complex-to-map, coding and non-coding regions. In the variant calling step, DRAGEN was fastest. In terms of accuracy, DRAGEN and DeepVariant performed similarly and both superior to GATK, with slight advantages for DRAGEN for Indels and for DeepVariant for SNVs. The DRAGEN pipeline showed the lowest Mendelian inheritance error fraction for the GIAB trios. Mapping and alignment played a key role in variant calling of WGS, with the DRAGEN outperforming GATK.

Whole genome sequencing (WGS) of a human genome for less than 1000 USD has become reality 1 , and cost might drop to even 100 USD with the Illumina NovaSeq system 2 . Large-scale WGS studies have already been initiated or even completed because of the greater cost efficiency, and the number of such studies is expected to increase, with several to be run in smaller labs [3][4][5] . The availability of fast, simple to use and accurate genotype calling pipelines is therefore of utmost importance.
GATK is the most frequently used pipeline 6 , but other pipelines have outperformed it in terms of the F 1 statistic, precision, and recall. Specifically, DeepVariant 7 won the first precisionFDA Truth Challenge for shortread sequencing in 2016 and had the highest accuracy of single nucleotide polymorphisms (SNPs). The winner of the second precisionFDA Truth Challenge for short-read sequencing in 2020 was Illumina's DRAGEN 8 . It outperformed other pipelines 6,[9][10][11][12] , in particular in difficult-to-call genomic regions 13 . It is, however, unclear which pipeline should be used for secondary analysis of WGS data, i.e., from fastq to vcf, in terms of both speed and accuracy. The best choice is challenging because recent pipeline comparisons relied on synthetic-diploid www.nature.com/scientificreports/ and simulated data 14 , did not include the award-winning DeepVariant and/or DRAGEN pipelines 6,9,10,12,15 or focused on the comparison of different sequencing platforms 16 . Furthermore, several comparisons focused only on variant calling, the last step in the secondary analysis pipeline.
Since it is important to select the most appropriate secondary analysis pipeline for large-scale WGS efforts, we compared DRAGEN against GATK with BWA-MEM2 in the mapping and alignment steps (upstream pipeline), and DeepVariant and DRAGEN against GATK in the variant calling step (downstream pipeline, Fig. 1). To assess the performance of the pipelines, we used samples from the genome in a bottle (GIAB) consortium as the reference truth set. We successfully sequenced one GIAB sample (NIST ID HG002) 70 times in different sequencing runs and one trio (NIST IDs HG002, HG003 and HG004) three times. These sequences allowed for the comparison of different pipelines against truth sets for single nucleotide variations (SNVs), insertions and deletions as well as complex variants (Indels) in simple-to-map, difficult-to-map, coding, and non-coding regions.   Supplementary Fig. S1). The GATK Haplotypecaller in the DRAGEN mode took 189 ± 26 min to run, and 134 ± 20 min were required when GATK was run in the standard mode.
Called variants and Ti/Tv ratio. When DRAGEN was used in the alignment step, on average > 100,000 more filter passing variants were detected as compared to using GATK for mapping andalignment. Furthermore, when DRAGEN was used for variant calling, approximately 200,000 additional variants passed filtering as compared with DeepVariant or GATK in the variant calling ( Table 1).
The transition-to-transversion (Ti/Tv) ratio is an important measure to assess the quality of SNV calling. After stringent quality control, WGS studies are expected to have a Ti/Tv ratio around 2.0-2.2 17 . For our GIAB  Benchmarking for SNVs and Indels. Figure 3 displays F 1 scores, precision, and recall for SNVs (upper part A) and Indels (lower part B), respectively, for the 6 different pipelines after filtering based on chromosomes 20-22 of HG002. Both parts of the figure show substantially higher F 1 scores when DRAGEN was used upstream compared to GATK. DeepVariant and DRAGEN performed similarly in the variant calling step when mapping and alignment was done with the DRAGEN. F 1 scores and precision were slightly higher for SNVs with Deep-Variant compared to DRAGEN, which has its basis in higher precision. In contrast, the recall for SNVs was lower for DeepVariant than for DRAGEN. In the case of Indels, the F 1 score of the DRAGEN was higher than for DeepVariant, when the DRAGEN was used upstream. Supplementary Figure S2 confirms these findings using all autosomes for GIAB sample HG003. On this sample, the DRAGEN outperformed GATK with BWA-MEM2 upstream. The DRAGEN also slightly outperformed DeepVariant in the variant calling step on the F 1 score for both SNVs and Indels. However, the precision of DeepVariant was still higher for DeepVariant compared to DRAGEN variant calling for both SNVs and Indels.

Benchmarking for SNVs and Indels in simple-to-map (simple) and difficult-to-map (complex) regions.
Results were similar for difficult-to-map (complex) and simple-to-map (simple) regions based on chromosomes 20-22 of HG002 (Fig. 4). These regions have been defined for the truth set of the GIAB samples; for details, see Methods. For SNVs, F 1 scores were lower in complex regions, when GATK was used upstream compared to any DRAGEN pipeline upstream. These differences in the F 1 statistic were primarily caused by low recall values for GATK-based pipelines. However, even precision values were lower for SNVs in complex regions when GATK was used for mapping and alignment. In simple regions, precision was similar for all pipelines in case of SNVs, while recall was lower when GATK was used. Consequently, F 1 scores were higher for all DRAGEN-based pipelines than for all GATK-based pipelines. A more detailed display for short, medium, and long insertions and deletions by regions is provided in Supplementary Figs. S3 and S4, respectively. Benchmarking using precision, recall and F 1 scores for Indels of different size. Figure 6 shows F 1 scores, precision, and recall for insertions (upper part) and deletions (lower part) of different sizes. Again, pipelines that used DRAGEN for mapping and alignment outperformed comparable GATK pipelines. Differences increased with the size of the insertions or deletions. In the variant calling step, DRAGEN overall showed a better performance than DeepVariant.
Mendelian inheritance error fractions. The GIAB trio consisting in HG002, HG003, and HG004 was sequenced three times to estimate Mendelian inheritance error fractions. Figure 7 shows that Mendelian inheritance error fractions were lower for the in-built DRAGEN variant caller. Overall, the DRAGEN mapping and alignment outperformed GATK with BWA-MEM2 mapping and alignment.  Computational efficiency was another aspect of interest in our comparative study. The highest speed was observed for the DRAGEN. Our GIAB samples had an average coverage of approximately 35×, and the processing time per sample was 36 ± 2 min for both mapping/alignment and variant calling, which was substantially faster than GATK. Unfortunately, we were unable to provide comparisons for cost per sample because these Figure 5. Percentages of F 1 score, precision, and recall for coding and non-coding regions for single nucleotide variation (SNV) (upper 6 panels) and Indels (lower 6 panels) for the 6 pipeline combinations, based on chromosomes 20-22 of the genome in a bottle sample HG003. Labels on the x-axis were defined in detail in Fig. 2. D DRAGEN,  Others considered "computational time the most important advantage of the DRAGEN platform" 14 . However, our pipeline comparison showed the high accuracy of the DRAGEN in the mapping and alignment step. We consider this to be even more important because it affects downstream association and functional annotation analyses. The aspect of accuracy is not only demonstrated by the repeated sequencing of the GIAB sample HG002 and the GIAB sample HG003 (Supplementary Fig. S2) and their comparison with the truth set data, but also by the Mendelian inheritance errors detected in the repeatedly sequenced GIAB trio. On the one hand, DeepVariant had fewer Mendel errors than GATK-based variant calling on parent-offspring trios after GATK-based mapping and alignment, which is in line with the findings of Lin et al. 6 . On the other hand, when the DRAGEN was used for mapping and alignment, there was almost no difference between DeepVariant and GATK in the variant calling step. Both mapping and alignment approaches resulted in similar, specifically lowest Mendel errors when DRAGEN was used for variant calling. Results might differ when DeepTrio is used rather than DeepVariant for variant calling in trios. www.nature.com/scientificreports/ Furthermore, variants that pass from variant callers are generally further filtered during gvcf joint calling steps. In the present work we have focused on the pipeline from fastq to single sample gvcf. Future work should investigate the effect of the different joint calling and gvcf filtering approaches.
One limitation of our trio analysis is that all subjects were treated individually. Inheritance-exploiting variant callers, such as DeepTrio 18 were not taken into consideration. However, large WGS studies generally focus on the association analysis of unrelated subjects. Furthermore, we only used subjects of Ashkenazi Jewish descent. In case of differences in the detection of variation between subjects with different population genetic backgrounds, this could, in principle, lead to a bias. However, we consider this bias to be negligible.
As different sequencers may have distinct error profiles 19 , one may consider the use of a single NovaSeq 6000 sequencer a limitation in the presented sequencing experiment, and it would be of interest to expand the sequencing to different platforms. However, the expensive platforms HiSeq X Ten and NovaSeq 6000 have low error rates and low levels of variation 19 . In addition, the NovaSeq 6000 sequencer is the only machine currently recommended by Illumina for large scale WGS. The just announced NovaSeq X series is not available on the market yet, and the HiSeq X Ten platform is not available anymore. Furthermore, the NovaSeq 6000 has excellent agreement in called variants not only within a genotyping center but also between genotyping centers 20 . Finally, the sequences were generated on a single NovaSeq 6000 by only two lab technicians, the resulting sequencing data generated are as homogeneous as possible.
Another limitation might be that DeepVariant was trained by incorporating all GIAB samples but HG003 and all autosomal data but chromosomes 20-22 21 . For our evaluation, we restricted our analysis to chromosomes 20-22 of HG002 to prevent overfitting because these chromosomes were excluded from all training and model selection steps for all samples for DeepVariant. Sensitivity analyses involved GIAB sample HG003 (Supplementary Fig. S2), for which all data were not included in the training of DeepVariant. Finally, we also analyzed all chromosomes of HG002 (data not shown). In all three cases, the performance of DeepVariant differed only marginally.
Future studies might consider constructing a super learner in the variant calling step involving the two bestperforming variant callers in this study, DeepVariant and DRAGEN.

Conclusions
The DRAGEN mapper and aligner had higher accuracy than the GATK with BWA-MEM2 mapper and aligner. DeepVariant and DRAGEN performed similarly for SNV and Indel variant calling, and both outperformed GATK variant calling pipelines. The DRAGEN pipeline showed the lowest percentage of Mendelian inheritance errors and had the shortest execution times. Because of accuracy and speed, we recommend the use of the DRAGEN for secondary analysis of WGS data.
The library was constructed according to Illumina TruSeq DNA PCR Free Library Prep protocol HT (Illumina Inc., San Diego, CA, USA) for whole genome sequencing. Briefly, the protocol steps were: (1) fragmentation of 1 μg genomic DNA to 350 bp inserts by Covaris LE220-plus, (2) cleanup of fragmented DNA, (3) repair ends, (4) removal of large and small DNA fragments, (5) 3′-end adenylation and (6) adapter ligation. The resulting library was quantified and quality-assessed with the iSeq100 (Illumina). The GIAB samples were sequenced with the NovaSeq 6000 platform (Illumina) using S4 flow cells with 300 cycles (2 × 150 reads) and measured 2 × to reach an average coverage of 35×.
Variant calling pipelines. The raw sequencing files (base call file) were converted to fastq format and demultiplexed in a single step using Illumina's bcl2fastq program from DRAGEN 3.8.4 on a single DRAGEN www.nature.com/scientificreports/ computer, version 2 22 . First steps in secondary analysis, including mapping and alignment, sorting, duplicate marking, and base quality recalibration were done using the GATK pipeline (version 4.2.4.1) 23 and the DRA-GEN pipeline (version 3.8.4) 24 (Fig. 1). The GATK best practices workflow was applied 23 . Read trimming for the GATK pipeline was performed with BBDuk (version 38.90) using the following parameters: ktrim = r k = 23 mink = 11 hdist = 1 tpe tbo threads = 8 trimpolygright = 10 25 . For the DRAGEN pipeline, the integrated soft read trimer with default parameters was used. The human reference genome hs38DH was used for mapping, which contains the primary assembly of GRCh38 plus ALT contigs, additional decoy contigs and HLA genes 26 . BWA-MEM2 2.2.1 was employed for mapping and alignment 27 , which is faster than BWA-MEM 27 . In the GATK base quality score recalibration step, the genome was split into 64 fractions for higher computational efficiency (one fraction per core). In the variant calling step, 25 fractions were used. Variant calling for SNVs and Indels was performed using the GATK HaplotypeCaller not in the DRAGEN mode, the GATK HaplotypeCaller in the DRAGEN mode 28 , DeepVariant v1.1.0 7 , and DRAGEN 24 . Alignment and variant calling algorithms were combined to 6 different pipelines ( Fig. 1) Statistics. Precision, recall and F 1 scores were estimated for SNVs and Indels using all regions and regions stratified by difficult-to-map, simple-to-map and coding and non-coding regions. All comparisons were made against the GIAB truth set 31,32 , available from ftp.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/. Mappability of Indels was stratified by Indel size (1-5 bp, 6-15 bp, > 15 bp). Ti/Tv ratios were calculated to indicate potential sequencing error. The target sensitivity was varied and fixed to 99.75% for Indels and 99.95% for SNVs. Figures for varying target sensitivities are shown in Supplementary Fig. S6. Results from the 70 GIAB sample sequencing repetitions are summarized in boxplots. Performance metrics were calculated using hap.py v0.3.14 with vcfeval engine, available from github.com/Illumina/hap.py. The numbers of passed/failed variants, SNVs, insertions and deletions for each pipeline were calculated using RTG tools (version 3.12.1) 33 . The statistical analysis was performed with R (version 4.1.1) to calculate mean differences and 95% confidence intervals between runtimes (Supplementary Table S1) and performance metrics (Supplementary Table S2).

Genome regions for stratification analyses.
To compare the performance of the calling algorithms in simple-to-map, sometimes termed easy-to-map, and difficult-to-map regions of the human genome, we stratified the data in genome-specific regions by using complexandSVs_alldifficultregions and notin_complexan-dSVs_alldifficultregions 32 . Stratification for coding regions was based on notinrefseq_cds and refseq_cds stratifications. These annotations are available at github.com/genome-in-a-bottle/genome-stratifications.

Data availability
One trio dataset generated during the current study is available in the Sequencing Read Archive (SRA) repository, accession number: PRJNA907182. All data are available from the corresponding author on reasonable request for collaborative projects. www.nature.com/scientificreports/