Comprehensive benchmarking and guidelines of mosaic variant calling strategies

Rapid advances in sequencing and analysis technologies have enabled the accurate detection of diverse forms of genomic variants represented as heterozygous, homozygous and mosaic mutations. However, the best practices for mosaic variant calling remain disorganized owing to the technical and conceptual difficulties faced in evaluation. Here we present our benchmark of 11 feasible mosaic variant detection approaches based on a systematically designed whole-exome-level reference standard that mimics mosaic samples, supported by 354,258 control positive mosaic single-nucleotide variants and insertion-deletion mutations and 33,111,725 control negatives. We identified not only the best practice for mosaic variant detection but also the condition-dependent strengths and weaknesses of the current methods. Furthermore, feature-level evaluation and their combinatorial usage across multiple algorithms direct the way for immediate to prolonged improvements in mosaic variant detection. Our results will guide researchers in selecting suitable calling algorithms and suggest future strategies for developers.

Supplementary Figure 3. Evaluation on SNV detection in paired-sample.Sensitivity and distribution of false positives of the eleven evaluated approaches for SNV detection are shown for three different sequencing depths (125X, 250X, and 500X), by using 39 truth sets.Sensitivities in all possible combinations of expected VAF pairs were binned and shown as heat maps within each circle on the plane and axis.Points at the plane and x or y axis refers to shared and sample-specific variants, respectively.Observations along VAFs are shown on a log10 scale and the dashed line refers to 10%.Callers within gray box used joint genotyping.

Supplementary figure 5. Evaluation on INDEL detection in paired sample
Sensitivity and distribution of false positives of the eight evaluated approaches for INDEL detection are shown for three different sequencing depths (125X, 250X, and 500X), by using 39 truth sets.Sensitivities in all possible combinations of expected VAF pairs were binned and shown as heat maps within each circle on the plane and axis.Points at the plane and x or y axis refers to shared and sample-specific variants, respectively.Observations along VAFs are shown on a log10 scale and the dashed line refers to 10%.Callers within gray box used joint genotyping.

[1] Selection of ploidy in HaplotypeCaller and the ploidy effect
We chose to use ploidy 20 and ploidy 200 in this benchmark based on the recommendation of a previous paper from Brain Somatic Mosaicism Network (BSMN) 1 , which sets the ploidy to the 20% of the overall sequencing coverage.Therefore, we chose ploidy 20 and 200 in our evaluation, as our benchmark data ranged from 125× to 1,100×.We additionally tested two different ploidy settings, ploidy 50 and 100, to inspect the effect of the ploidy values on the mosaic SNV detection performance.We found a clear trade-off between sensitivity and specificity.The higher ploidy option increased sensitivity but lost precision, and vice versa in the lower ploidy options (Supplementary Figure 8).But the overall classification power (F1-score) remained similar.Therefore, ploidy options can be selected lower or higher from the 20% of the overall coverage depending on the research purposes.
[2] The theoretical upper limits in performance evaluation In performance evaluation, there can exist two types of false negatives (FN), (1) in which the true alternative alleles are present in the data or (2) the alternative alleles are absent in data.In the case of (2), the undetected sites are not truly false negatives that come from variant calling performance and should be distinguished from (1).To first investigate the positions of (2), we re-assessed the VAFs of the positions in the original cell line data to check if there were sufficient allele frequencies to be germline variants.For all control positive sites (18,873), we found that all the positions showed VAF>25% (Supplementary Figure 9).Together with the robustness of the germline variant calling process in the original cell line sequencing 2 , we are convinced that all the FN calls were truly existing variants in the source of the benchmark dataset (cell line mixtures).
Then, we found that sequencing data of the cell line mixtures may not contain the alternative alleles, even if the original cell line had, particularly when only a small portion of DNA was mixed to generate low VAF variants.For instance, we found that 2% and 17% of the true positive sites had no alternative alleles for 1% and 0.5% mixtures, in 1,100x coverage.It is true that no tools can call any true variants within these sites theoretically.Although undetectable, the final performance should include these FNs to represent the real-world level interpretation, because the same issue will be reproduced for all real analyses.Therefore, we calculated the theoretical upper limits of the calling performance for all evaluations of sensitivity and F1-score, to separate the FNs originated by miscalling from ones that are theoretically undetectable (Supplementary Figure 10).
[3] Generation of the in silico simulation dataset We generated in silico reference data set by computational simulation to compare with the biologically generated ground truth data utilized in this study.The in silico mixture dataset is another version of the ground truth variant set that was generated by mixing BAM files (instead of DNA) of the six cell lines.Like in the biological reference standard, we generated 39 in silico mixtures with the same designated proportions (Supplementary Table 1).This would be a much simpler and cost-effective procedure than the mixing DNAs, because we do not conduct actual sequencing of the 39 samples.Also, it is much more convenient to achieve accurate target mixture-proportion.
However, when we assessed the error profiles from the in silico mixtures, we found that the error profiles (e.g., the positions and types of the sequencing errors) are confined to the ones in the original cell line sequencing, which greatly limit the variety of the noises (Supplementary Figure 11).Consequently, the overall evaluation cannot be conducted with sufficient noise sources and levels that we face in the real-world settings.Therefore, we concluded that the mixture-derived reference standard is a more appropriate approach for a mosaic variant detection benchmark study.

Methods to generate in silico dataset:
We collected the BAM files of the original source cell lines from Sequence Read Archive under the accession code [PRJNA758606] and used Picard (v.2.2.2) DownsampleSam for down sampling and Samtools (v.1.14)merge for merging the BAM files.After generating the down-sampled BAM files according to the mixing ratios and combinations, 39 in silico reference dataset were generated.
To compare the error profiles in the reference data to the in silico simulation data, we generated pileups (Samtools mpileup v.1.14)for both data sets to gather the raw allele counts with default parameter settings.The unexpected alternative allele counts in each data were then collected based on the reference homozygous negative control sites, if there was one or more alternative alleles on the negative control sites.After that, the Jaccard similarity of the errors in each data were compared to all the other samples in pair-wise manner, resulting in 741 cases by selecting two out of 39 samples.

Supplementary Figure 11. Comparison of error profile in truth sets (a)
Comparison of the error profiles between the mosaic reference standards and in silico reference standards with 39 truth sets.The in silico reference standards were generated with the identical combinations and mixing ratios.Unexpected alternative alleles in the non-variant negative control sites were calculated with the sequence data pileups and the similarities of them were calculated by Jaccard Index.M1, M2, and M3 refers to the three different categories of the reference standards depending on the mixing combination.

[4] Defining the relative distance between samples in a developmental lineage tree
We demonstrated a new strategy that utilize the relative distance in developmental hierarchy within multi (≥3) samples.By comparing the common ancestors in each pair of samples, we can define sample relatedness and their relative distance.First, we define the distance between a sample pair p1 = (A, B) in a phylogenetic tree using their most recent common ancestor c(A, B) (Supplementary Figure 12).We say a sample pair p1 = (A, B) is relatively proximal to a sample pair p2 = (A, C), if c(p2) = c(p1, p2) and c(p1) ≠ c(p1, p2).In other words, a sample pair is in a more proximal lineage if they are branched out more recently.For example, two samples developed from same germ layer (e.g., brain and skin; both from ectoderm) are more proximal than ones from two different germ layers (e.g., brain and skeletal muscle; ectoderm and mesoderm).Likewise, the relative distance of the samples in three mosaic types (M1-M3) could be defined, wherein samples in a more proximal lineage indeed had more shared variants.Although the cellular heterogeneity in tissue (e.g., blood in tissue) can limit the clarification of the origin of each tissue in real-world, using a subset of well-characterized somatic mutations could be utilized for reconstructing the developmental lineages.
Supplementary Figure 12.Diagram to define the relative distance between samples.c(Sample1, Sample2) denotes the common ancestor between Sample1 and Sample2.
[5] The effect of capture bias in the exome-based benchmark result To investigate the effect of the capture bias on the performance evaluation, we first calculated the sequencing depths at the positive control positions of SNVs and INDELS in the 39 ground truth data (Samtools mpileup v.1.14).The average sequencing depths was 1,100× calculated by the Qualimap (v.2.2.1), but the depths at the positive controls had high deviation and resulted lower average depths (616.2× and 139.0 for SNV and INDEL respectively).As a result, we confirmed the (1) variance towards genomic positions (2) consistent lower depth in INDEL sites (Supplementary Figure 13 and Supplementary Table 6).For INDELs, the fluctuation of the sequencing depths was observed as much high, showing 121× at the 1 st quantile and 684× at the 3 rd quantile (see table below).It is an intrinsic characteristic of capture-based whole exome sequencing that reflects the real-world performance which possibly affect overall accuracy.
In addition, we also tested the extent of the coverage unevenness affecting the performance evaluation.We divided the ground truth data of SNVs into three subsets based on base coverage: lowest 25%, medium 50%, and highest 25%, which corresponded to <360×, 360× to 803×, and >803×, respectively.We then compared the sensitivity, precision, and F1-score between the three subsets, observing that the effect of coverage variability depends on the VAFs (Supplementary Figure 14).For the low VAF variants (< 10%), the overall performance (F1-score, rightmost column) was higher in the high coverage regions.In contrast, we did not observe a significant effect in the medium and high VAF variants.Overall, it is true that the mosaic variant calling performance is variable within the exon regions in the same sequencing data, but the effect is limited to low-VAF variants.In the same context, we expect that the reported variant calling performance will be similar or slightly higher in the whole genome-or panel-sequencing if subjected to the same target depth.Because the coverages are more uniform in WGS for both genome and in panel-seq for targeted intervals, which reduces the portion of low-coverage regions, compared to WES (Supplementary Figure 15).Therefore, assuming the same target depth, a slight increase in performance is expected in genome or panel-seq.
Supplementary Figure 15.Coverage consistency in sequencing data targeting different size of the genomic region.
[6] Possible noise in the ground truth data To investigate the effect of noises due to the mixing procedure, we gathered the germline variants of the six cell lines, MRC5, RPE, CCD-18co, HBEC30-KT, THLE-2, and FHC.The merged number of germline variants was 62,719 in the target region of the whole exome sequencing (SureSelect Human All Exon V6, Agilent Technologies, Inc., CA, USA).We calculated the distances between all positive controls (18,873) and the nearest germline variants in source cell lines and observed the average 1,772 bp (Supplementary Figure 16).Despite the average distance was large, we found that approximately 39% of the true variants were accompanied with germline variants within a read length (150 bp).Even though mutations can exist with germline or subclonal mutations within a read 3 , we acknowledge a potential over-representation that comes from multiple cell lines.
To systematically measure the potential effect, we calculated the sensitivities of all tools for only variants with proximal (<150 bp) germline variants.In all tools except MT2 and MF, no significant differences were observed.In contrast, sensitivity was approximately 4.9% and 4.1% lower in MT2 and MF (see Supplementary Table 7), implying a potential negative effect on these tools; we assume that the "clustered event" filtration strategy of MT2 makes these tools more sensitive to the presence of nearby germline variants.Again, 9-44% (27% in overall) of subclonal variants were observed with nearby germline variants 3 , and the same decrease in sensitivity would be reproduced in real-world data.So, we assume the true effect from our study design should be partial.Also, the overall conclusion and the final best practice have not been affected because MT2 and MF were already top ranked with significant gaps from the others.Supplementary Table 7. Difference of sensitivities with positive controls that are proximal to germline variants.A subset of positive controls from 39 truth sets were selected when one resides with proximal germline variants within a read.The sensitivities of the subset were compared to the original sensitivities and the differences of them are shown.
[7] The impact of the populational frequency database utilization with germline variant-derived ground truth data When evaluating tools that utilize populational frequency databases for filtering germline variants (MosaicHunter, DeepMosaic, and Mutect2), we had to exclude the entries of positive controls from those databases whenever they existed.The purpose of this modified application was to prevent each tool from treating the true positive variants as known SNP positions.By doing this, we assumed that germline variantderived true variants are equivalent to the sporadically generated variants regarding measuring variant calling accuracy.
To investigate if this modification introduced bias to the benchmark result, we conducted the same benchmark analysis on a subset of the true variants that are not located in SNP sites: the source of this non-SNP true variants is the rare germline variants in the original cell lines.We built three subsets from each database ("panel_of_normal": 1000g_pon.hg38.vcf.gz by Broad Institute, gnomAD: "hg38_gnomad211_gnomad.txt, and dbSNP build 154), resulting in to contain 53%, 5%, and 0.8% of the original call set respectively.We found that the evaluated performance result is almost same (Supplementary Figure 17), except a few intervals with small data points, which proves that the germlinederived construction did not lead to a bias.
Moreover, we assessed if the restriction of the true variant positions to the non-SNP sites leads to another type of bias: mosaic variants that occur in known SNP sites have chances to be filtered out.This may lead to overestimated sensitivity as our true variant sets do not allow such filtration.However, we also assumed that the actual effect of the filtration is limited due to the rarity of the SNP sites of sufficiently high pAF (e.g., pAF > 0.001) compared to the genome size.To assess the amount of the possibly affected SNPs, we retrieved the sizes of the three pAF databases ("panel_of_normal", gnomAD, and dbSNP in same version Supplementary Figure 17.Performance evaluation with non-SNP originated truth set.Three subsets of the positive controls where SNPoriginated true variants were removed from the three populational frequency databases were used for evaluation.F1-scores of SNV performance are shown in the eight VAF bins (<1%, 1%-2%, 2%-3%, 3%-4%, 4%-7.5%,7.5%-9.6%,9.6%-25%, and ≥ 25%) with 1,100× data.PC, positive control used in the benchmark) by their allele frequencies (Supplementary Table 8).We found that most of the entries were rare SNPs (pAF < 0.001).In total, 98,481,895 dbSNP and 51,447,633 gnomAD entries had pAF>0.001.Thereby, we could calculate the probability that a random mosaic variant hits these sites as 3.1% and 1.6% (numbers divided by 3.2B).In other words, even without allowing SNP regions, our benchmark dataset emulates > ~97% of real-world cases, which makes the overall result reliable.
We further assume that the true effect is even smaller because (1) not all the SNPs > 0.001 is filtered out (MosaicForecast uses the information as an optional post-filter, other tools use it as rather one of the supportive factors within classification models (DeepMosaic) or in calculation of mosaic probabilities (MosaicHunter)), along with more directly evident features from sequencing data.Also, (2) the actual match is even rarer; tools consider the ref-alt allele pair for match, which reduce the actual match by ~1/3.For PoN, all the matches are filtered regardless of the pAFs and the matching probability was 0.08% (2,609,566 divided by 3.2B).8. Proportion of the positive controls matching the entries in populational allele frequency databases.The proportions of the positive controls in 39 truth sets that matched with the three populational allele frequency databases, dbSNP, gnomAD, and panel of normal (Methods) utilized by MosaicHunter, MosaicForecast, and Mutect2 are shown.
[8] Performance evaluation with biological datasets and confirmation of the suggested recommendation We conducted performance evaluation of mosaic variant detection methods with additional independent biological datasets, to confirm reproducibility of our benchmark results and the recommendations (Table 1).Three independent biological datasets were selected, in which the answer sets were validated with orthogonal experiments.Importantly, utilizing multiple datasets was imperative, because the features of mosaic variants (e.g., variant allele frequency (VAF) and variant-sharing) and the optimal detection methods can be largely varied.Thus, we selected three previously published mosaicism studies that had clearly validated answers, originated from different tissues within various study designs.First, we obtained deep sequenced (~500X) multi-organ WES data (Kim et  Supplementary Table 9.Independent biological datasets utilized for evaluation.Data types, tissues, original detection and validation methods, and number of the true and false answer sets are shown. (1) VAF landscapes of the mosaic variants in the ground truth and the biological datasets In the same context with our Recommendation #1, the characteristics of mosaic variants can be largely varied under each study design, which can directly affect the detection performance.Among all mosaic variant features, variant allele frequency (VAF) was one of the most critical factors that can alter detection strategy.Thus, we first investigated the VAF distributions of the mosaic variants in this study (ground truth data) and those in five independent mosaicism studies 1, 4-7 , including three of them utilized for the additional evaluation (BioData 1-3).As expected, the VAF distributions were observed to be highly dependent on the tissues that were sequenced (Supplementary Figure 18).(4) Sample-specific mosaic variant detection in paired-sample Among the three biological datasets, we could attain sample-specific variants with its validation results from BioData1, where the VAFs of the 25 true variants were distributed evenly from low (<10%) to high (>25%) (Supplementary Table 11).When an individual had more than two samples, the nine evaluated detection methods were applied to all possible sample pairs and only the variants existed in a single sample (not in all others) were collected.As a result, we confirmed that MT2 and STK2 showed the best F1-score across all VAF ranges including low VAF (<10%), as suggested in Recommendation #6 (Supplementary Figure 21).In other words, somatic methods that utilize joint genotyping supported by matched controls are currently the best strategy for sample-specific mosaic variant calling. (

5) Shared mosaic variant detection in paired-sample
To evaluate shared variant detection performance with biological datasets, we could apply BioData1 and BioData2, supported by 56 and 28 answer variants respectively.The answer sets were comprised of true variants that were shared by two to six samples from an individual (Supplementary Table 12).Lastly, we recommended using MF with a rescue strategy for shared variant detection (Recommendation #8).The rescue strategy was suggested to improve the sensitivity of shared variant detection by rescuing the misclassified calls, calls that were detected in only one of the samples and missed in the other.As expected, we confirmed that MF with rescue strategy achieved a remarkable performance as a single method with BioData2, covering all the VAF areas including the most challenging VL-VL (very low in both, <5%) area (Supplementary Figure 25).BioData1 could not be applied because no misclassified variants existed.

Supplementary Figure 13 .Supplementary Table 6 .
Sequencing depths at SNV and INDEL positive controls in 1,100× exome data.The sequencing depths at all positive control SNVs and INDELs from 39 truth sets are shown.Boxplots with first and third quantiles with median are shown with whiskers representing maximum and minimum.Sequencing depths at SNV and INDEL positive controls.Three quantiles and mean of sequencing depths at SNV and INDEL positive controls 39 truth set are shown (1,100×).

Supplementary Figure 16 .
Distance between positive controls and nearest germline variants in source cell lines.The densities of the log 10-scaled distance were calculated from the 39 truth sets.Three quantiles are shown with boxplot with the minimum and maximum value as whiskers.The red dashed line depicts 150 bp which is the length of the sequencing reads utilized in this study.

( 2 )
For example, mosaic variants from Breuss et al. 2022 (study in brain) were mostly enriched in low VAF (<5%), whereas ones from Hsieh et al. 2020 (study in heart) were distributed in a wider VAF range (5-25%).The only common characteristic was the general enrichment in low VAFs and the rapid decrease in the number of high VAFs.Since our benchmark covers diverse conditions and scenarios in mosaic variant calling, we selected the three biological datasets BioData1-3 (Kim et al. 2022, Breuss et al. 2020, and Wang et al. 2021) that were previously mentioned.In this way, we could secure not only the validated variants with varied range of VAFs but also covering variants in various scenarios such as balanced or imbalanced shared variants and sample-specific variants.Mosaic variant detection in a single sample Supplementary Figure 18.The distribution of variant allele frequencies of true mosaic variants in the ground truth data and five other biological studies.

Recommendation # 7 :
For shared variant detection, although MosaicForecast and Mutect2 showed good overall accuracy, we want to highlight that the best-performing tool for each VAF range varied.We recommend also trying out the best tools for the expected VAF-pair area if one exists.

Supplementary Figure 25 .Recommendation # 8 :
F1-score of the MF with rescue strategy in shared variant detection.Each VAF combinatorial area was based on four VAF ranges (very-low (VL): ≤ 5%, low (L): > 5% and ≤ 10%, medium (M): > 10% and ≤ 25%, and high (H): ≥ 25%).In the present situation, we recommend utilizing somatic callers (Mutect2 or Strelka2) for sample-specific variant and MosaicForecast with the rescue strategy for shared mosaic variant detection with paired samples al. PLOS Genetics 2022, BioData1)4, which provided 130 validated positions with 117 of them validated as true, where majority of them were shared by two or more organs.Also, we utilized 250X WGS data of sperm and blood pairs from eight individuals (Breuss et al.Nature Medicine 2020, BioData2)5, 57 true shared mosaic variants.Lastly, we acquired 250X WGS data from a single neurotypical brain tissue followed by robust and extensive validation experiments, conducted by the Brain Somatic Mosaicism Network (BSMN) (Wang et al.Genome Biology 2021, BioData3)1.From this data, we could obtain 43 true positives with 357 false positives derived from various error sources.The detailed information of the three studies and their answer sets are described in the Supplementary Table9.