Supplementary Figure 10: The brain WGS-trained model could be applied to detect mosaics with a wide range of VAFs based on simulated dataset. | Nature Biotechnology

Supplementary Figure 10: The brain WGS-trained model could be applied to detect mosaics with a wide range of VAFs based on simulated dataset.

From: Accurate detection of mosaic variants in sequencing data without matched controls

Supplementary Figure 10

(a) Simulated mosaic mutations at different allele fractions were generated in the 300× bam file from the HapMap sample (NA12878), and the bam file was then down-sampled to 50-250X. The observed allele fraction distributions (green) at different read depths are concordant with expected allele fraction distributions (red, binomial sampling using the observed read depths and expected allele fractions). (b) Simulated mosaics with a wide range of simulated VAFs (0.01, 0.02, 0.03, 0.05, 0.1, 0.2, 0.3, 0.4) were generated in the 250× data of NA12878 to evaluate if the model is applicable to detect mosaics with a wide range of VAFs. False sites were a mix of germline-heterozygous variants and ‘repeat’ variants phased to have >3 haplotypes from MuTect2-PoN calls of the original bam file of NA12878. The brain-WGS trained model of MosaicForecast was applied to call mosaic variants from the dataset with a mix of simulated mosaics and false sites and used to generate the ROC curves. (c) We randomly selected and mixed simulated mosaic variants with expected allele fractions of 0.02, 0.03, 0.05, 0.1 and 0.3 following a realistic allele fraction distribution of early-embryonic mosaics (4:4:4:2:1) and down-sampled to 50-250× to mimic the real early-embryo mosaic mutations in non-tumor tissues. (d) VAF distribution of false sites from in the down-sampled bam files of NA12878 (down-sampled from the original bam file without the simulated mosaics). To generate a set of false sites, sites were first called by MuTect2-PoN from the 50-250× bam files. Sites with <0.02 VAF or ≥ 0.4 VAF (calculated by MT2), sites tagged as ‘str_contraction’, ‘t_lod_fstar’, ‘triallelic_site’ and sites present in the panel-of-normals were excluded, sites phased by MosaicForecast as ‘hap=2’ and ‘hap>3’ were then mixed as the false sites (used to calculate the ROC curve).

Back to article page