Sharing genetic variants with the NGS pipeline is essential for effective genomic data sharing and reproducibility in health information exchange

Genetic variants causing underlying pharmacogenetic and disease phenotypes have been used as the basis for clinical decision-making. However, due to the lack of standards for next-generation sequencing (NGS) pipelines, reproducing genetic variants among institutions is still difficult. The aim of this study is to show how many important variants for clinical decisions can be individually detected using different pipelines. Genetic variants were derived from 105 breast cancer patient target DNA sequences via three different variant-calling pipelines. HaplotypeCaller, Mutect2 tumor-only mode in the Genome Analysis ToolKit (GATK), and VarScan were used in variant calling from the sequence read data processed by the same NGS preprocessing tools using Variant Effect Predictor. GATK HaplotypeCaller, VarScan, and MuTect2 found 25,130, 16,972, and 4232 variants, comprising 1491, 1400, and 321 annotated variants with ClinVar significance, respectively. The average number of ClinVar significant variants in the patients was 769.43, 16.50% of the variants were detected by only one variant caller. Despite variants with significant impact on clinical decision-making, the detected variants are different for each algorithm. To utilize genetic variants in the clinical field, a strict standard for NGS pipelines is essential.

Genome or exome sequencing using next-generation sequencing (NGS) technologies has now entered medical practice 1 . Genetic variant databases for clinical applications were built on numerous studies of human genetic variants affecting response to medications associated with diseases and phenotypes [2][3][4] . As guidelines for the interpretation of sequence variants have been established, clinical laboratories now perform genetic testing for therapeutic decision-making and disease prediction. Nonetheless, the construction of uniform standards for NGS pipelines is difficult because of various genetic testing techniques, different experimental goals, and numerous algorithms 5 . As a result, clinical laboratories and medical institutions have generated patients' genetic variants through different sequencing protocols and NGS pipelines, leading to genetic variants that are not interoperable.
The current gold standard for variant-calling pipelines is the Genome Analysis Toolkit (GATK) Best Practices Workflow pipeline using HaplotypeCaller, which is considered to have the highest accuracy for single nucleotide polymorphisms (SNPs) and small insertions and deletions 6,7 . However, the development of numerous NGS sequencing technologies, such as Illumina and BGI, has caused data-specific effects, making it difficult to build a uniform pipeline 8,9 . Data-specific effects cause false positive detection due to unexpected systematic error patterns in the HaplotypeCaller algorithm using GATK Best Practices 10 . Therefore, it is difficult to build NGS pipeline guidelines and make genetic variants interoperable in clinical practice.
The importance of reliable genetic data communication between hospitals and clinical genomic data sharing to improving genetic health care is widely recognized, and the practice has been encouraged by both professional societies and funding agencies 11 . Before sharing genetic variant data derived from raw sequencing data, the validity of the variant-calling pipeline result must be verifiable. However, different NGS pipelines among institutions produce different variant calling results despite the same raw sequencing data, causing serious problems in clinical decision-making and genetic variant sharing. Hence, diagnostic genetic tests used as a basis for clinical decision-making should be reproducible or replicable 12 . The deleteriousness of the called variants. To infer the importance of genetic variants, we annotated the deleterious values of the SIFT, PolyPhen, and CADD algorithms that predict the intolerance of the variant by the conservation between species. For variants called using GATK HC , MuTect2 and VarScan, 2224, 1960,  and 40 variants were annotated with SIFT, 2345, 2078, and 41 with PolyPhen, and 435,999, 307,152, and 16,719 with CADD, respectively (Fig. 2). Among the variants annotated using SIFT, 363 (16.32%), 342 (17.45%), and 9 (22.50%) deleterious variants were observed with scores < 0.05 for GATK HC, VarScan, and MuTect2, respectively. Of the variants with annotated PolyPhen scores, the numbers of deleterious variants with scores > 0.95 were 120 (5.12%), 109 (5.25%), and 1 (2.44%) for GATK HC, VarScan, and MuTect2, respectively. Among the  www.nature.com/scientificreports/ To visualize the distribution of differentially detected clinically significant variants, individual distributions of patients with mutations are presented in a Venn diagram. In Fig. 3, ClinVar is based on variants corresponding to drug_response, likely_pathogenic, pathogenic, protective, and risk_factor. Truncation is based on variations whose consequence is the loss of function. The SIFT score was 0.05 or less, the PolyPhen score was 0.85 or more, and the CADD score was 15 or more.
To characterize the differently called variants, we reviewed variants that included significant consequences, deleteriousness scores, and ClinVar annotations that GATK HC found but VarScan did not (Table 3). ABCA4 is an ATP-binding cassette (ABC) transporter (OMIM 601691; GenBank U88667). Diseases associated with ABCA4 include age-related macular degeneration and Stargardt disease 13,14 . Diseases associated with DHCR7 include Smith-Lemli-Opitz Syndrome and holoprosencephaly. There is much evidence associating the variant    8,9 . Therefore, we suggest that the entire pipeline throughout the variantcalling process, including raw sequencing data, should be shared to enhance the reproducibility of the genetic variants. All processing included in the NGS pipeline, such as the version of the programs, options, and additional files with each version, should be shared to reproduce or replicate the same genetic variant from the raw sequence. Of the genetic variants called by different NGS pipelines, we quantified how many important variants were missed, affecting clinical decision-making. As a result, we found that important variants affecting clinical decisions are found quite differently according to the variant-calling algorithm. Several studies suggest that the result of variant calling differs by NGS preprocessing and variant-calling pipeline 25,26 . Moreover, the result of variant calling is different for different sequencers, despite using the same raw sequence data and NGS pipeline. Nevertheless, establishing a guideline with a uniform NGS pipeline for a single best practice is difficult because the performance of NGS pipelines differs by sequencer, purpose of the sequencing, and characteristics of the sample 27 . Therefore, there is the risk of making a clinical decision with a genetic variant in an institution that does not perform NGS pipeline because the institution cannot reproduce the result of the variant calling. Hence, details of the NGS pipeline for the entire variant-calling process are essential.
To evaluate the significance of the variants called by three different variant caller algorithms, GATK Haplo-typeCaller, MuTect2, and VarScan, we used the consequence, deleteriousness score, and ClinVar classification. Consequences of variants, referred to as loss-of-function mutations, can be divided into truncation and nontruncation mutations. Truncation mutations have a profound impact on the loss of gene function. SIFT, PolyPhen, and CADD scores are algorithms that measure deleteriousness of genes based on conservation and protein structure. ClinVar annotated variants are clinically significant genetic variants categorized into pathogenic, drug response, risk factor, and more, which are important information in making clinical decisions. Truncation mutations, deleterious variants, and clinically significant variants have different results depending on the variant-calling algorithm, even though they are variants that have a large effect on gene function (Fig. 3). Thus, NGS pipelines that produce different variant calling results can have a significant impact on clinical decisions based on genetic variants.
Our study has some limitations. We only measured variant differences based on variant callers. From the read alignment algorithm to the final variant-calling process within the entire NGS pipeline, various factors can affect variant calling. We could not test all of them due to the combination explosion, but we focused on variant calling. A replication study of the genetic testing pipeline used in hospitals is needed. From the NGS pipeline information used in hospitals, we need to test whether the variant calling results can be reproduced from the same raw sequence data.
In conclusion, our results show that clinically important variants are differently called by variant callers, thus affecting clinical decisions. This means that variant calling outcomes are not reproducible without detailed NGS pipeline information. Therefore, we suggest that the pipeline throughout the variant-calling process, including raw sequencing data, should be shared for effective genetic variant sharing and clinical decision-making. Pre-processing of DNA resequencing data. The raw FASTQ files and paired sequence data were aligned to the human genome hg38 assembly using the Burrows-Wheeler Aligner, BWA program, version 0.7.12, and were transformed into a sequence alignment map (SAM) format 30 . Using SAMtools version 1.9, sequence data in SAM format was compressed into Binary Alignment Map (BAM) format by view command, and the aligned sequence reads were sorted with leftmost coordinates by sort command. Read groups are added to aligned sequence files using the' AddOrReplaceReadGroups' module in Picard. Next, SAMtools was used to prepare index referencing and BAM files 31 . After preparing these files, GATK version 3.8 was used to perform Realigner Target Creator and Indel Realigner to locally realign regions containing insertions and deletions to correct misaligned reads 6 . Base quality scores were adjusted using GATK BaseRecalibrator with the dbSNP build 138 and 1000-genome gold standard indels provided by the GATK Resource bundle standard files for working with human resequencing data (https ://softw are.broad insti tute.org/gatk/downl oad/bundl e) 32,33 . The sequencing target section was extracted using the bedtools intersect version 2.26 with indexing 34 . Finally, Picard MarkDuplicates v1.93 was used to identify duplications with the option to flag and remove duplicate reads.

Small variant detection.
After preprocessing the DNA sequencing data, we detected single nucleotide variants (SNVs) using three algorithms. VarScan and GATK HaplotypeCaller (HC) were used to find genetic variants between the sample DNA sequence compared with the reference sequence 35 . Somatic variants were called using GATK MuTect2 36 . Variants called by a mixture of germline and somatic variant calling tools were compared based on the assumption that NGS pipeline information was not properly shared during the communication process for the genetic variant of the patient. Reference genome databases, dbSNP build 138, and COSMIC, a source of commonly mutated genes, were used for the variant-calling argument.
Genetic variant annotation. The Ensembl Variant Effect Predictor (VEP) was used to determine the effect of genetic variants derived from the three variant callers, HC, MuTect2, and VarScan 37 . The mutation consequence, SIFT score, PolyPhen score, CADD score, and ClinVar annotations were determined to examine the effect of differently called variants on variant callers 3,38-40 . Consequences were divided into truncating and non-truncating mutations. While truncating mutations included nonsense mutations, frameshift deletions, frame shift insertions, and splice-site mutations, non-truncating mutations included missense mutations, inframe deletions, in-frame insertions, and nonstop mutations. To evaluate the significance of genetic variant effects, SIFT, PolyPhen, and CADD algorithms for predicting the deleteriousness of variants were used. SIFT score < 0.05, PolyPhen > 0.95, and CADD > 15 were defined as deleterious variants. The clinical significance of genetic variants was cataloged by making comparisons in ClinVar (http://www.ncbi.nlm.nih.gov/ClinV ar/).