Accurate detection of mosaic variants in sequencing data without matched controls

Abstract

Detection of mosaic mutations that arise in normal development is challenging, as such mutations are typically present in only a minute fraction of cells and there is no clear matched control for removing germline variants and systematic artifacts. We present MosaicForecast, a machine-learning method that leverages read-based phasing and read-level features to accurately detect mosaic single-nucleotide variants and indels, achieving a multifold increase in specificity compared with existing algorithms. Using single-cell sequencing and targeted sequencing, we validated 80–90% of the mosaic single-nucleotide variants and 60–80% of indels detected in human brain whole-genome sequencing data. Our method should help elucidate the contribution of mosaic somatic mutations to the origin and development of disease.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Framework of MosaicForecast to detect mosaic SNVs from bulk sequencing data.
Fig. 2: Comparison among algorithms.
Fig. 3: Impact of read depth on sensitivity and detection of mosaic indels.

Data availability

The WGS data are available at the National Institute of Mental Health Data Archive (https://nda.nih.gov/study.html?id=644).

Code availability

MosaicForecast is implemented in Python and R. The source code, documentation and examples are available at https://github.com/parklab/MosaicForecast/.

References

  1. 1.

    Biesecker, L. G. & Spinner, N. B. A genomic view of mosaicism and human disease. Nat. Rev. Genet. 14, 307–320 (2013).

  2. 2.

    Bae, T. et al. Different mutational rates and mechanisms in human cells at pregastrulation and neurogenesis. Science 359, 550–555 (2018).

  3. 3.

    Ju, Y. S. et al. Somatic mutations reveal asymmetric cellular dynamics in the early human embryo. Nature 543, 714–718 (2017).

  4. 4.

    Ye, A. Y. et al. A model for postzygotic mosaicisms quantifies the allele fraction drift, mutation rate, and contribution to de novo mutations. Genome Res. 28, 943–951 (2018).

  5. 5.

    Lodato, M. A. et al. Somatic mutation in single human neurons tracks developmental and transcriptional history. Science 350, 94–98 (2015).

  6. 6.

    Dou, Y., Gold, H. D., Luquette, L. J. & Park, P. J. Detecting somatic mutations in normal cells. Trends Genet. 34, 545–557 (2018).

  7. 7.

    Dou, Y. et al. Postzygotic single-nucleotide mosaicisms contribute to the etiology of autism spectrum disorder and autistic traits and the origin of mutations. Hum. Mutat. 38, 1002–1013 (2017).

  8. 8.

    Freed, D. & Pevsner, J. The contribution of mosaic variants to autism spectrum disorder. PLoS Genet. 12, e1006245 (2016).

  9. 9.

    Krupp, D. R. et al. Exonic mosaic mutations contribute risk for autism spectrum disorder. Am. J. Hum. Genet. 101, 369–390 (2017).

  10. 10.

    Lodato, M. A. et al. Aging and neurodegeneration are associated with increased mutations in single human neurons. Science 359, 555–559 (2018).

  11. 11.

    Yang, X. et al. Genomic mosaicism in paternal sperm and multiple parental tissues in a Dravet syndrome cohort. Sci. Rep. 7, 15677 (2017).

  12. 12.

    Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).

  13. 13.

    Alioto, T. S. et al. A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing. Nat. Commun. 6, 10001 (2015).

  14. 14.

    Huang, A. Y. et al. Distinctive types of postzygotic single-nucleotide mosaicisms in healthy individuals revealed by genome-wide profiling of multiple organs. PLoS Genet. 14, e1007395 (2018).

  15. 15.

    Lim, E. T. et al. Rates, distribution and implications of postzygotic mosaic mutations in autism spectrum disorder. Nat. Neurosci. 20, 1217–1224 (2017).

  16. 16.

    Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

  17. 17.

    Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591–594 (2018).

  18. 18.

    Bohrson, C. L. et al. Linked-read analysis identifies mutations in single-cell DNA-sequencing data. Nat. Genet. 51, 749–754 (2019).

  19. 19.

    Karczewski, K. J. et al. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. Preprint at bioRxiv https://doi.org/10.1101/531210 (2019).

  20. 20.

    Costello, M. et al. Characterization and remediation of sample index swaps by non-redundant dual indexing on massively parallel sequencing platforms. BMC Genomics 19, 332 (2018).

  21. 21.

    Karimzadeh, M., Ernst, C., Kundaje, A. & Hoffman, M. M. Umap and Bismap: quantifying genome and methylome mappability. Nucleic Acids Res. 46, e120 (2018).

  22. 22.

    Smit, A., Hubley, R. & Green, P. RepeatMasker Open-4.0 (2013–2015).

  23. 23.

    Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).

  24. 24.

    Huang, A. Y. et al. MosaicHunter: accurate detection of postzygotic single-nucleotide mosaicism through next-generation sequencing of unpaired, trio, and paired samples. Nucleic Acids Res. 45, e76 (2017).

  25. 25.

    Chen, L., Liu, P., Evans, T. C. Jr. & Ettwiller, L. M. DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification. Science 355, 752–756 (2017).

  26. 26.

    Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).

  27. 27.

    McInerney, P., Adams, P. & Hadi, M. Z. Error rate comparison during polymerase chain reaction by DNA polymerase. Mol. Biol. Int. 2014, 287430 (2014).

  28. 28.

    Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).

  29. 29.

    Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).

  30. 30.

    Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

  31. 31.

    Haeussler, M. et al. The UCSC Genome Browser database: 2019 update. Nucleic Acids Res. 47, D853–D858 (2019).

  32. 32.

    Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).

  33. 33.

    Bragg, L. M., Stone, G., Butler, M. K., Hugenholtz, P. & Tyson, G. W. Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data. PLoS Comput. Biol. 9, e1003031 (2013).

  34. 34.

    Meacham, F. et al. Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics 12, 451 (2011).

  35. 35.

    Huang, A. Y. et al. Postzygotic single-nucleotide mosaicisms in whole-genome sequences of clinically unremarkable individuals. Cell Res. 24, 1311–1327 (2014).

  36. 36.

    Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).

  37. 37.

    Bischl, B. et al. mlr: Machine Learning in R. J. Mach. Learn. Res. 17, 1–5 (2016).

  38. 38.

    Kuhn, M. Building predictive models in R using the caret package. J. Stat. Softw. 28, 26 (2008).

  39. 39.

    Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

Download references

Acknowledgements

This work was supported by National Institutes of Health grants (nos. U01MH106883, R01NS032457, T32HG002295 and T32GM007753); by the Harvard Ludwig Center; and by a Horizon 2020 grant (no. 703543). We thank C. Chen, H. Gold, C. Chu, V. Viswanadham and G. Nelson for their helpful comments.

Author information

Y.D. developed the algorithm and performed the analysis, under supervision by P.J.P. M.K. generated call sets from MuTect2 and GATK haplotype callers. R.E.R. and R.D. evaluated candidate sites with targeted sequencing, supervised by C.A.W. I.C.C., L.J.L., A.G., C.B. and M.K. helped to refine the algorithm and contributed to editing of the manuscript. Y.D. and P.J.P. wrote the manuscript. All authors discussed the results and contributed to the final manuscript.

Correspondence to Peter J. Park.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 MuTect2 has high sensitivity in detecting mosaics.

(a) MuTect2 detected more simulated mutations than GATK-HC-p2, GATK-HC-p5 and MosaicHunter at different allele fractions and read depths in non-RepeatMasker regions. Variants with >0.4 VAF were excluded from the call sets shown. (b) MuTect2 detected more real mosaics than the other tools considered, as confirmed by single cell sequencing from three individuals (1465, 4638 and 4643). These include both RepeatMasker and non-RepeaMasker region sites.

Supplementary Figure 2 A panel-of-normals strategy is effective in removing germline variants and recurrent technical artifacts.

(a) Among the variants present in the PoNs, 82.3% were present in ≥3 individuals, and thus, are most likely to be technical artefacts or common variants. Furthermore, although 17.7% of the recurrent variants were only called twice by MuTect2 in the 75 individuals, about half of them have ≥3 alt alleles in ≥3 individuals (checked using samtools mpileup), indicating that these are most likely artefacts. (b) The MosaicForecast-Refine model trained with our brain WGS data was applied to call mosaics from the MuTect2 call set (using data from three individuals with single cell sequencing data available), different variant filtering strategies were used, including not using a panel-of-normals (left), filtering variants called in >2 samples (middle) and filtering variants called in ≥2 individuals. Variants called were evaluated using single cell sequencing data and IonTorrent data. Removing recurrent variants does not result in the loss of any true variants, and the validation rate was ~10% higher.

Supplementary Figure 3 Candidate mosaic variants and nearby germline SNVs were used to do local haplotype-phasing.

Representative IGV plots showing (a) a non-diploid region phased with two nearby germline SNVs, (b) reads were phased to three haplotypes by using a potential mosaic variant and a nearby germline SNV, (c) a diploid region phased with a potential heterozygous germline variant and a nearby germline SNV and (d) reads in the region were phased to >3 haplotypes by using a potential false positive variant and a nearby germline SNV. Red triangles indicate candidate mosaics, and blue triangles indicate nearby germline SNVs.

Supplementary Figure 4 Evaluation of candidate mosaics in single cell sequencing data.

(a) The four types of variants have different VAF characteristics among multiple single cells. Although there could be differential amplicon-induced allelic imbalance and amplification failure-induced allele dropout problems in single cell sequencing data, germline-heterozygous variants typically are present in most cells, and the average allele fraction among multiple cells is about 50%. Instead, mosaic variants would be only present in a subset of cells, refhom sites would be absent from all cells, and germline variants within CNV/repeat regions would typically be present in multiple cells, with an average allele fraction among mutant cells much lower than 50%. (b) IGV plots for reads covering a real mosaic variant site in 10 single cells. This variant has ~17% VAF in the WGS bulk sequencing data. (c) An example showing a heterozygous variant judged by single cell sequencing data. (d) An example showing a CNV/repeat variant (likely to be within regions with reads-mapping problems) in single cells. (e) Comparison of average VAFs of single cell lineage-judged germline-heterozygous variants and single cell lineage-judged repeat variants (calculated with 20 single cells from 1465). Germline-heterozygous variants had much higher average VAFs among single cells, compared with CNV/repeat variants.

Supplementary Figure 5 Validation rate for the Phasing Prediction Model and different read-level features for different phasable sites in the Principle component analysis (PCA) space.

(a) Orthogonal validation rates (IonTorrent, single cells and trio’s information) for candidate mosaics phased to have three haplotypes (left bar), non-phasable candidate mosaics predicted by MosaicForecast-Phase as ‘hap=3’ (middle bar) and after further removing sites within the 19Mb non-diploid region (right bar). The validation rates for variants within non-RepeatMasker regions (left) and RepeatMasker regions (right) are shown separately. (b) Scatterplot of phasable sites in 2D PCA space. 1000 sites were randomly selected from all phasable sites to make this plot. (c) Contributions of the read-level features to the PCA components (in percentage): (var.cos2 *100)/(total cos2 of the component). The squared cosine “cos2” indicates the contribution of a variable to the squared distance of the observation to the origin and shows the importance of a variable for a given observation. Features that are closer to the center are less important features. Refer to Supplementary Table 1 for descriptions of the read-level features. (d) The weights for the top-ranked principle components are shown. The first five components with the maximum amount of variation in the dataset explains ~50% of the total variance.

Supplementary Figure 6 Inclusion of read-level features in the multinomial logistic regression model enables the refinement of genotypes.

(a) Projection of the experimentally evaluated phasable sites onto the PCA space defined using all read-level features. True mosaics within RepeatMasker regions and non-RepeatMasker regions were projected onto similar positions in the PCA space. (b) The first five PCA components with the maximum amount of variations were explained by different read-level features. For example, genotyping likelihoods, difference of base query positions for ref/alt alleles, difference of read mapping positions and difference of read mapping qualities for ref/alt reads were the most important features contributing to PC1. The squared cosine “cos2” indicates the contribution of a variable to the squared distance of observation to the origin and shows the importance of a variable for the given observation. Color intensity and the size of the circle are proportional to cos2. (c) The multinomial logistic regression model we built by including the first five PCA components could effectively correct mislabeled phasable sites, as suggested by experimental validations. (d) Orthogonal validation rates of candidate mosaics predicted by the multinomial logistic regression (left bar), non-phasable candidate mosaics predicted by the MosaicForecast-Refine model (middle bar), and after further removing “clustered sites” within the 19MB non-diploid regions (right bar). The validation rates for variants in non-RepeatMasker regions (left) and RepeatMasker regions (right) were shown separately. (e) Top-ranked feature importance of read-level features of the RandomForest model (MosaicForecast-Refine).

Supplementary Figure 7 Candidate mosaics called by different tools evaluated as ‘refhom’ by single cell sequencing data were further evaluated by IGV-plot and IonTorrent.

(a) Variants called by different tools and evaluated as ‘refhom’ by single cell sequencing data were further separated into different groups called by only one tool, or called by ≥2 tools. 24 variants were further evaluated with IonTorrent deep sequencing. While a large proportion of MosaicForecast calls were true mosaics, none of the variants called only by MuTect2 or GATK were true. (b) Before IonTorrent sequencing, mosaics called by each tool were extensively evaluated by IGV plotting. IGV plots indicates that most variants called only by MuTect2 or GATK genotypers were within hard-to-map regions showing a number of mismatches, which were unlikely to be true mosaics.

Supplementary Figure 8 MosaicForecast exhibits high predictive power to detect mosaics and is especially advantageous in detecting mosaics with low allele fractions.

(a) Real mosaic SNVs detected by different tools from three individuals (1465, 4638, 4643) with 250× WGS data, validated by multiple single cells as well as targeted sequencing. MosaicForecast detected more mosaic SNVs with lower allele fractions (AF≤0.05) than GATK HaplotypeCallers and MosaicHunter. (b) Of the candidate mosaics called by MosaicForecast, PCR-based samples have a lot more G>T mutations, which are known to be correlated with oxidative DNA damage, especially sites in haploid chromosomes (chrX/Y in male). Haploid chromosomes have 50% lower amount of DNA for library preparation and sequencing, which is likely to be correlated with the bias. (c) Of the IonTorrent validated mosaics detected by MosaicForecast, MosaicHunter or GATK HaplotypeCallers were only able to detect <65%, and a large fraction of low-allele fraction mosaics (AF≤0.05) were missed by these tools.

Supplementary Figure 9 Read-based phasing analysis shows that MuTect2, GATK HaplotypeCallers and MosaicHunter have a higher false positive rate than MosaicForecast.

According to our evaluations in Fig. 1c, most variants with ‘hap=2’ should be germline-heterozygous variants, while most variants with ‘hap>3’ should false positives. Variants from MuTect2 and GATK haplotype callers all had a large fraction of variants phased to have 2 or >3 haplotypes. MosaicHunter has a ‘linkage filter’ which filters out all variants completely linked with one allele of nearby SNPs, this is the reason MosaicHunter detects less phasable sites and no sites with ‘hap=2’. MosaicHunter does not call variants from RepeatMasker regions, while all other tools call mosaics genome-wide.

Supplementary Figure 10 The brain WGS-trained model could be applied to detect mosaics with a wide range of VAFs based on simulated dataset.

(a) Simulated mosaic mutations at different allele fractions were generated in the 300× bam file from the HapMap sample (NA12878), and the bam file was then down-sampled to 50-250X. The observed allele fraction distributions (green) at different read depths are concordant with expected allele fraction distributions (red, binomial sampling using the observed read depths and expected allele fractions). (b) Simulated mosaics with a wide range of simulated VAFs (0.01, 0.02, 0.03, 0.05, 0.1, 0.2, 0.3, 0.4) were generated in the 250× data of NA12878 to evaluate if the model is applicable to detect mosaics with a wide range of VAFs. False sites were a mix of germline-heterozygous variants and ‘repeat’ variants phased to have >3 haplotypes from MuTect2-PoN calls of the original bam file of NA12878. The brain-WGS trained model of MosaicForecast was applied to call mosaic variants from the dataset with a mix of simulated mosaics and false sites and used to generate the ROC curves. (c) We randomly selected and mixed simulated mosaic variants with expected allele fractions of 0.02, 0.03, 0.05, 0.1 and 0.3 following a realistic allele fraction distribution of early-embryonic mosaics (4:4:4:2:1) and down-sampled to 50-250× to mimic the real early-embryo mosaic mutations in non-tumor tissues. (d) VAF distribution of false sites from in the down-sampled bam files of NA12878 (down-sampled from the original bam file without the simulated mosaics). To generate a set of false sites, sites were first called by MuTect2-PoN from the 50-250× bam files. Sites with <0.02 VAF or ≥ 0.4 VAF (calculated by MT2), sites tagged as ‘str_contraction’, ‘t_lod_fstar’, ‘triallelic_site’ and sites present in the panel-of-normals were excluded, sites phased by MosaicForecast as ‘hap=2’ and ‘hap>3’ were then mixed as the false sites (used to calculate the ROC curve).

Supplementary Figure 11 Models trained using sequencing data at one read depth could be applied to call mosaics from sequencing data at different read depths.

(a) Models trained using WGS brain data at 50-250× read depths were applied to call variants from 50-250× testing sets constructed with the HapMap sample (NA12878). As described before, the dataset contained a list of simulated mosaics following a realistic allele fraction distribution of early-embryo mosaics, as well as a mix of germline and other false mutations called by MuTect2-PoN. The models were robust across different read depths, as indicated by the ROC curves. (b) In the three individuals with single cell sequencing data available, reads for all candidate sites called in 250× data were down-sampled to 50×, 100×, 150× and 200× respectively, and all the read-level features for non-phasable sites were extracted from the sampled reads. The brain-WGS trained RF models trained with phasable sites trained at 50-250× read depths were applied to non-phasable sites in the three individuals, and variants were evaluated with single cell sequencing data and IonTorrent sequencing. The models were robust across different read depths, as indicated by the high validation rate obtained, and tend to have close-to-optimal performance when applied to test sets with similar read depths.

Supplementary Figure 12 The MosaicForecast model trained using MuTect2 calls from the brain WGS data augments other software’s abilities to pick up true mosaics.

By applying the brain-WGS trained RF model of MosaicForecast (based on MuTect2-PoN calls) to identify mosaics from different software tools, all of their validation rates increased substantially. We used leave-one-out cross validation strategy while making predictions (i.e., all variants from 1465 were not used in the RF model while making predictions on 1465). Variants from GATK HaplotypeCallers and MosaicHunter were evaluated with single cell sequencing and IonTorrent sequencing, while additional variants detected with samtools mpileup were evaluated with single cell sequencing data alone.

Supplementary Figure 13 Prediction and evaluation of mosaic deletions.

Scatterplot of for all phasable deletions in the 2D PCA space of read-level features. The ‘hap=3’ sites are well distinguished from other phasable sites. (b) Contributions of the read-level features to the PCA components (in percentage). Refer to Supplementary Fig. 5 for description of the variable importance. (c) Weights of top-ranked principle components. The first five components with the maximum amount of variations in the dataset explains ~50% of the total variance. (d) Representative IGV plots showing a validated mosaic deletion. To detect the mosaic deletions, we designed three pairs of primers, and the three resulting sets of PCR products were sequenced using IonTorrent. The variant under consideration was present only in the case and absent from the control sample (or present in case at much higher proportions). (e-f) Representative IGV plots showing two validated mosaic deletions in the single cell sequencing data.

Supplementary Figure 14 PCA of the read-level features computed for the MuTect2 Insertions.

(a) Scatterplot of for all phasable insertions in the 2D PCA space of read-level features. The “hap=3” sites are well distinguished from “hap=2” sites, but hard to distinguish from “hap>3” sites. (b) Contributions of the read-level features to the PCA components (in percentage). Refer to Supplementary Fig. 5 for description of the variable importance. (c) Weights of top-ranked principle components. The first five components with the maximum amount of variations in the dataset explains ~50% of the total variance.

Supplementary Figure 15

Evaluation of complex mosaic events in single cell sequencing data. Representative IGV plots showing two validated mosaic events in the single cell sequencing data, including (a) a multi-nucleotide variant with two continuous base substitutions and (b) clumped variants with a nearby SNP and an insertion.

Supplementary information

Supplementary Materials

Supplementary Figs. 1–15

Reporting Summary

Supplementary Table 1

Description of 31 read-level features

Supplementary Table 2

MuTect2-PoN call set and experimental evaluations

Supplementary Table 3

Orthogonal evaluation of candidate mosaics called by different tools in single-cell sequencing, targeted sequencing or trio sequencing data

Supplementary Table 4

Extrapolated genotypes with multinomial regression model for all phasable SNPs from brain WGS data, based on haplotype number and PCA components of read-level features

Supplementary Table 5

Genotype predictions of mosaic SNPs in brain WGS data based on the phasing prediction model

Supplementary Table 6

‘Clustered’ regions likely to be nondiploid genomic regions

Supplementary Table 7

Genotype predictions and evaluations of nonphasable sites from brain WGS data based on the MosaicForecast-Refine model

Supplementary Table 8

Lineage trees constructed with true mosaics in single-cell sequencing data were used to evaluated candidate sites

Supplementary Table 9

Candidate mosaics called by MosaicForecast evaluated by targeted deep sequencing

Supplementary Table 10

Genotype predictions of brain WGS data with different read depths of 50–200X

Supplementary Table 11

The RF model trained with MuTect2 calls could augment other tools’ ability to pick up true mosaics, based on validations

Supplementary Table 12

All phasable sites of mosaic deletions in brain WGS data

Supplementary Table 13

Genotype predictions of mosaic deletions in brain WGS data based on the phasing prediction model as well as the validation results

Supplementary Table 14

Genotype predictions and validations of mosaic insertions in brain WGS data based on the phasing prediction model

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dou, Y., Kwon, M., Rodin, R.E. et al. Accurate detection of mosaic variants in sequencing data without matched controls. Nat Biotechnol (2020). https://doi.org/10.1038/s41587-019-0368-8

Download citation