INTRODUCTION

A growing body of evidence implicates the importance of somatic mosaicism in the etiology of many human genetic disorders, including both cancer and Mendelian conditions.1,2,3,4,5,6,7,8 If a pathogenic single-nucleotide variant (SNV) or copy-number variant (CNV) occurs during any of the ~1016 mitotic postzygotic cell divisions, the resulting different cell populations can manifest clinically.9 If present in the parental germline cells, the variant can be transmitted to the offspring.10,11,12,13,14

Exome sequencing (ES) has been used extensively in both clinical settings and research studies; however, to date, only a few reports have described more in-depth analyses of somatic mosaicism. Recently, Wright et al. analyzed the trio ES data of 4293 probands mainly with developmental disorders and identified ~3% causative variants exhibiting postzygotic mosaicism.15 We have analyzed a cohort of ~12,000 samples submitted for clinical ES and identified clinically relevant somatic mosaic variants in ~1.5% of probands.16

In 2014, we described low-level (<10%) parental somatic mosaicism for CNV deletions detected in 4 of 100 unrelated families,17 and more recently, we presented accurate methods for detection and validation of mosaic CNVs.18,19 Corroboratively, SNV studies in multisibling families using genome sequencing revealed that in parental germline, 3.8% of SNVs were mosaic, resulting in 1.3% of variants being shared by siblings.20,21 Notably, the level of somatic mosaicism in the parental blood samples has been shown to positively correlate with the overall recurrence risk.20,21,22 In ES data, parental mosaicism was detected in 0.3–0.5% of the analyzed family trios.15,16 Most recently, Breuss et al. reported that autism risk in offspring could be assessed through quantification of male sperm mosaicism, further indicating the correlation between the level of mosaicism and disease recurrence risk.23

Here, we have studied ES data of almost 2000 unrelated trios from Baylor-Hopkins Center for Mendelian Genomics (BHCMG) at Baylor College of Medicine (BCM) cohort and trios from Baylor Genetics (BG) Laboratories at BCM, respectively. We describe a new approach to identify low (<10%) and very low (<1.0%) level somatic mosaicism in the parents and provide a classification tool enabling more accurate assessment of the level of somatic mosaicism in ES samples.

MATERIALS AND METHODS

Ethics statement

The research studies at BHCMG were approved by the Institutional Review Board (IRB) for Human Subject Research at BCM under the protocol H-29697. All analyzed samples were coded. All studied BG samples were de-identified using the IRB waiver protocols H-41191 and H-42680. To study different somatic tissues, written informed consent was obtained from nine participants or their legal guardians. The research was IRB approved at BCM under the protocol H-28088.

Baylor-Hopkins Center for Mendelian Genomics data set

ES was performed previously on a research basis in 7790 individuals enrolled in BHCMG at BCM to accelerate the discovery of a variant allele and contributory genetic locus underlying a wide range of Mendelian conditions (http://bhcmg.org/, accessed June 2019). To study low-level parental somatic mosaicism, we have selected ES data with the complete BAM (reads were mapped to GRCh37.p13) and VCF files from 823 family trios included in the BHCMG cohort. DNA samples were processed according to the protocols previously described.24 In addition, all variants identified by the Mercury pipeline (v3.2)25 were also annotated using Variant Effect Predictor (VEP, v96)26 that incorporates GENCODE release 19 for gene annotations. Average read depth across analyzed samples was ~90× with > 95% having 20× base coverage.

Selection criteria for the search of candidate mosaic variants and quality control

To identify low-level parental somatic mosaic variants, we have performed a two-step filtering (Fig. 1). First, we have analyzed the VCF files to select variants for which probands were found to be heterozygous. Thus, we calculated the variant allele fraction (VAF, defined as a proportion of the number of alternate allele reads relative to the total number of reads at the variant position) for each particular variant. In our recent study, we showed that more than 95% of apparent de novo autosomal SNVs and X-linked SNVs in females have VAF range between 36% and 64% by next-generation sequencing (NGS) analysis.16 Here, to eliminate genotype calls erroneously classified as heterozygous, we have used more strict criteria and removed variants with the VAF below 30% or above 70%. In addition, we have required that variants with VAF between 30% and 70% in the probands were not simultaneously reported by Atlas2 variant caller (v1.4.3)27 in the parental samples, or if detected in the parents, have VAF below 10%. Second, variants with the total depth of coverage below 20× in any samples from the given trios were excluded from further analyses. Subsequently, for each selected SNV, we have retrieved pileup information from the proband and parental BAM files that enabled obtaining more precise data on read depth and VAF in these samples. To further narrow the list of candidate mosaic events, we have required that all variants have a minor allele frequency (MAF) <0.01% in gnomAD (v2.1) (unpublished data) and <0.015% in the BHCMG data set, and are not located within the repetitive sequences or segmental duplication regions as identified by the genomic superDups track28 as well as pseudogenes (except one unique DNA region within segmental duplication for which we were able to design polymerase chain reaction [PCR] primers) from the University of California–Santa Cruz Genome Browser (https://genome.ucsc.edu/). To remove likely false positive (FP) events (i.e., technical artifacts), we have excluded variants that occur in the top 5% trios with the highest number of mosaic candidates.

Fig. 1: Candidate mosaic variant selection in Baylor-Hopkins Center for Mendelian Genomics (BHCMG) cohort.
figure 1

VCF files from 823 trios from the BHCMG cohort were used to identify variants that are likely heterozygous in probands and have zero or low coverage in one of the parental samples. In the second step, for each selected variant, pileup data from corresponding BAM files was retrieved. This information, along with external annotations (e.g., gnomAD allele frequency [AF]), was used to further narrow the list of mosaic candidates. SNV single-nucleotide variant.

Baylor Genetics Laboratories data set

We analyzed family trio ES data from approximately 15,000 patients enrolled in clinical diagnostic studies. Average depth of coverage was ~100× with >70% of reads aligned to target, >95% target base covered at >20×, >85% target base covered at >40×. Since ES data in BG have been preprocessed using a different analytical pipeline than in the BHCMG cohort, we modified the mosaic SNV candidate selection accordingly. We have used three different data subsets, as presented in Supplementary Fig. 1. The first subset of parental mosaic variants was derived from the analysis of 3175 apparent de novo heterozygous SNVs in the probands selected previously in the process of clinical analysis. Second subset consists of approximately 1000 trios for which joint VCF files were generated on the Illumina DRAGEN 2 platform. We focused on unique rare variants that occurred in only one family. We also removed any variants that overlapped segmental duplications. Similar to the approach used for the BHCMG cohort, we required a depth of at least 20 reads in each parent, an evidence of heterozygous state in the proband with a VAF of 30%–70% and 0 < VAF < 10% in one parental sample (homozygous reference state in the other parental sample). In the next step, only clinically relevant variants with a read depth ≥50× have been selected, followed by manual analyses of the pileup data of parental samples. Additional 9 samples (third subset) were included after being flagged by the BG directors as suspected somatic mosaic cases during manual analyses of the pileup data.

Exome sequencing QC

As a quality control (QC) measure, each DNA sample undergoing ES in either BHCMG or BG cohorts is analyzed in parallel by a coding single-nucleotide polymorphism (cSNP) array (Illumina Human Exome-12v1 array) to ensure correct sample identification and to assess sequencing quality. This approach warrants greater than 99% concordance between both methods.29 When contamination above 5% is detected than the sequencing data are further investigated and resequenced if needed.

DNA extraction

Initial ES in the BHCMG and BG cohorts was performed on the blood samples in greater than 95% of cases. In the remainder of cases, it was saliva. For validation experiments, peripheral blood DNA was extracted using the Gentra Puregene Blood kit (Qiagen, Germantown, MD, USA). For the selected cases from the BG cohort, at least five hairs with follicles were collected, and DNA was extracted using the QIAamp DNA Investigator Kit (Qiagen). Saliva was collected using the ORAgene Discover OGR-500 kit (DNA Genotek, Ottawa, Canada). Buccal cells were collected using the ORAcollect OC-175 kit (DNA Genotek). Both saliva and buccal cell DNA were extracted using the prepIT-L2P (DNA Genotek). DNA from urine was extracted using the Quick-DNA Urine Kit (Zymo Research, Irvine, CA, USA). All procedures followed the manufacturer’s instructions.

Validation of candidate mosaic variants using molecular methods

To validate putative parental somatic mosaicism of the selected variants, we have used three different molecular techniques: amplicon-based NGS, droplet digital PCR (ddPCR), or blocker displacement amplification (BDA).

Amplicon-based NGS

PCR primers targeting the putative mosaic variants were designed using BatchPrimer3 v1.0 and Primer3 v. 0.4.0 tools. The tested parental samples were amplified by PCR using recombinant Taq DNA Polymerase (ThermoFisher Scientific, Waltham, MA, USA). Each 150-µl reaction contains 1× Taq Buffer with (NH4)2SO4, 1.5 mM MgCl2, 0.2 mM dNTPs, 0.5 µM forward and reverse primer, 3.75 U of Taq polymerase, and 200 ng of DNA. The PCR products were purified by QIAquick PCR Purification Kit (Qiagen) according to the manufacturer’s instructions. Concentration of the purified PCR amplicons was quantified by Qubit dsDNA BR Assay (ThermoFisher Scientific) using the Qubit 4 Fluorometer (ThermoFisher Scientific). The purified amplicons of 300–338 bp were sequenced using the HiSeq 2500 platform (Illumina, San Diego, CA, USA) with 300-bp paired-end (PE) reads at BGI (San Jose, CA, USA) or using the HiSeq X system (Illumina) with PE150 reads at CloudHealth Genomics (Shanghai, China). Integrative Genomics Viewer (IGV, v2.3) software30 was used to analyze the data, as well as in-house developed scripts implemented in the R programming language.

Droplet digital PCR

DNA oligo primers as well as variant and wild type specific FAM or HEX labeled probes targeting the potential mosaic variants were designed and purchased from IDT (Coralville, IA, USA). In each 20-µl reaction, 10 µl of ddPCR Supermix for Probes (No dUTP) (Bio-Rad, Hercules, CA, USA), 0.5 µM forward and reverse primer, 4 units of HindIII-HF restriction enzyme (New England Biolabs, Ipswich, MA, USA), and 100 ng of DNA were added. For each family, the proband’s DNA sample was utilized as a positive control and an unrelated wild type DNA from blood sample was used as a negative control. A no template control was used to confirm no DNA contamination was present in the starting reagents and workflow. The ddPCR reactions were carried out using QX200 AutoDG Droplet Digital PCR System (Bio-Rad) and analyzed with QuantaSoft Analysis Pro software v1.7.4 (Bio-Rad) (http://www.bio-rad.com/webroot/web/pdf/lsr/literature/QuantaSoft-Analysis-Pro-v1.0-Manual.pdf) according to the manufacturer’s protocols. Each parental sample was run in at least triplicates.

Blocker displacement amplification

To determine the VAF in parental DNA, 12 samples were tested using BDA with the probands’ DNA samples as positive controls. BDA principles were previously described in detail by Wu et al.31 Quantitative PCR (qPCR) assays were performed with the use of PowerUp SYBR Green Master Mix (ThermoFisher Scientific) with 400 nM of each primer, 4 µM of blocker, and 10 ng of DNA per well. The amplification of GC-rich fragments was carried out with the addition of betaine (Sigma Aldrich, St. Louis, MO, USA) at a final concentration of 1 M. Reactions in the total volume of 10 µl were performed using CFX96 Touch Real-Time PCR Detection System (Bio-Rad). Each reaction was repeated at least twice. The qPCR products from two experiments were purified, Sanger sequenced, and analyzed using the ApE software (v2.0) (https://jorgensen.biology.utah.edu/wayned/ape/; https://openwetware.org/wiki/ApE_-_A_Plasmid_Editor_(software_review).31

RESULTS

BHCMG cohort

Computational analyses

We obtained 309,221 genotype calls fulfilling the initial inclusion criteria. After removal of the low-quality sequencing samples and variants with MAF > 0.01%, we found 3156 apparent de novo variants in 768 probands. In the parental samples, 71 candidate SNVs, previously undetected by routine ES algorithms, met all filtering criteria (Fig. 1). Their VAFs ranged from 0.17% to 9.0%, with an average of 2.8%. Forty-two mosaic candidates absent in gnomAD had one alternate read supporting the variant allele, whereas the remaining 29 variants had two or more alternate reads. Among the 71 putative mosaic SNVs, 37 are exonic, including missense (n = 23), synonymous (n = 13), and nonsense (n = 1) variants. In addition, we have also selected variants mapping to the noncoding regions (n = 33) or at the splice site (n = 1).

Molecular verification of the candidate variants

Of the 71 mosaic candidates predicted using our computational approach, we evaluated 48 (68%) variants in the available DNA samples using at least one molecular method, i.e., amplicon-based NGS (n = 48), BDA (n = 12), or ddPCR (n = 18) (Supplementary Table 1). We have verified positive somatic mosaicism in 16 (33%) samples (Table 1, Fig. 2). The precision (TP/[TP + FP], where TP is the number of true positives and FP is the number of false positives) in the group of variants with two or more alternate reads at the variant position was 63.6% (14 of 22). Furthermore, when VAF was greater than 5% in the ES data, the prediction of somatic mosaicism was more reliable in that 7 of 8 (87.5%) SNVs were confirmed as mosaic events (Supplementary Fig. 2). The precision among candidates having a single read supporting the variant allele was 7.7% (2 of 26). To delineate additional predictors of true mosaicism in the group of candidate variants with a single alternate read, for each genomic position of a putative mosaic SNV, we have retrieved the pileup information from the remaining 7788 ES samples. For each variant, we have calculated the FracSupp value, defined as the fraction of samples having at least one alternate read at the position of the given candidate mosaic event. We have hypothesized that the presence of reads supporting an alternate allele at a given genomic position in the multiple samples from the BHCMG cohort may represent technical artifacts or recurrent sequencing errors rather than the true mosaic variants. Interestingly, we have found that in the group of variants with a single alternate read, the two candidates confirmed as TP mosaic events had significantly lower FracSupp value (Wilcoxon rank sum test, p = 0.046) than the remaining 24 FP events (Supplementary Fig. 3). In two subjects, VAFs measured by different methods (including ES) varied significantly between 6.4% and 19.4% in BAB5936 and between 1.2% and 20.5% in WPW160 (Table 1, Fig. 2).

Table 1 Parental low-level mosaicism rates in BHCMG cohort measured using ES, amplicon-based NGS, ddPCR, and BDA.
Fig. 2: Variant allele fraction (VAF) estimated using four different molecular methods: exome sequencing (ES), amplicon-based next-generation sequencing (NGS), blocker displacement amplification (BDA), and droplet digital polymerase chain reaction (ddPCR).
figure 2

If there are no results for a particular validation method we indicated that it was either not tested (NT) or validation did not succeed due to technical failure (TF). In most of cases, estimated VAFs were consistent among different experimental methods.

Impact of potential cross-sample contamination

A potential cross-sample contamination is another limiting factor in the detection of mosaicism in ES data that can lead to an increased number of false positives. All ES data used in this study passed quality control (see “Materials and Methods”); however, to confirm the lack of significant cross-sample contamination and to measure the actual level of contamination more accurately, we have processed the BHCMG samples that underwent orthogonal validation for mosaicism using the GATK CalculateContamination software. We found that on average, each sample yielded contamination of 1%, ranging between 0% and 5% (Supplementary Fig. 4) with no significant difference between the cohorts of samples that passed or failed validation. We did not observe any significant contamination (i.e., larger than 5%); however, in 15 samples, we found contamination levels higher than 1% (which was used as expected background noise cutoff in previous work32).

BG cohort

We have analyzed the apparent de novo SNVs detected in the probands. In the parental blood samples, we have selected 46 potentially mosaic exonic SNVs, including missense (n = 33), nonsense (n = 4), frameshift (n = 7), synonymous (n = 1), and untranslated region (UTR) (n = 1) variants. In addition, we have selected eight intronic variants, including six splice site variants. We have examined these variants for somatic mosaicism using amplicon-based NGS (n = 54) or ddPCR (n = 6). In the 45 samples having pileup data (from 58 labeled as DS1 or DS2 in Supplementary Fig. 1), the precision was 17.7% (8 of 45). In the subgroup of variants with two or more alternate reads at the variant position, the precision was 43.7% (7 of 16), whereas among candidates having a single read supporting the variant allele it was only 3.4% (1 of 29). In nine studied samples that were flagged by BG directors (DS3) as potential mosaic, three (33.3%) were confirmed as mosaic (Table 2).

Table 2 Parental low-level mosaicism rates in BG cohort measured using ES and amplicon-based NGS.

Distribution of VAFs among different somatic tissues

We had previously detected mosaicism level (calculated as VAF) greater than 10% in the whole-blood samples from three parents: M1.1, M3.1, and M8.2.16 To study somatic mosaicism in other tissues in these individuals, we have assessed their levels using amplicon-based NGS. For parent M1.1, in four tested tissues, the levels of mosaicism were estimated as 27.3%, 23.7%, 29.5%, and 40.2% in whole-blood, buccal, fibroblast, and hair samples, respectively. For parent M3.1, 3.3% mosaicism was detected in the buccal sample, 16.7% in the saliva sample, and 17.6% in the blood, whereas no evidence of this variant was found in the hair sample. For parent M8.2, we have identified similar levels of mosaicism in the blood (13.2%), buccal (14.2%), saliva (17.7%), and urine (15.8%) samples, with the exception of low-level mosaicism in the hair (2.5%) (Fig. 3). To expand the tissue distribution study, we have also included previously published six probands with somatic mosaicism greater than 10% in their blood samples.16 The most outlying VAFs were observed in the hair tissue, where the level of mosaicism was significantly higher in the hair than in the blood in three cases, and significantly lower in five cases. We have also found that in six of nine cases, VAFs observed in at least one nonblood tissue were higher than VAFs estimated for blood samples (either by ES or amplicon-based NGS) (Fig. 3).

Fig. 3: Distribution of variant allele fractions (VAFs) among six different tissues: blood, saliva, buccal, skin fibroblast, hair, and urine.
figure 3

Analyses were performed for nine individuals, including three unaffected parents and six affected probands. In the case of blood tissue, VAF was estimated based on both exome sequencing (ES) (labeled as “Blood_ES”) and amplicon-based next-generation sequencing (NGS) data (labeled as “Blood”). In six of eight cases, there was at least one tissue for which VAF was estimated to be higher than VAF in blood.

DISCUSSION

While recent advances in NGS techniques enable the detection of mosaic variants more precisely than Sanger sequencing, the identification of low- and very low–level somatic mosaicism in ES data remains challenging. Variants with VAFs lower than 10% are typically not detected using standard ES variant calling pipelines. To overcome these limitations, we have developed a more sensitive computational screening tool and have verified its robustness in the family trio ES data set using three independent experimental molecular methods.

The performance of NGS methods depends primarily on a read depth at that given base pair. Theoretically, these methods could detect mosaic variants with a single alternate read (VAF = 1/N, where N is the total read coverage at the variant position). However, based on the experimental data, it has been shown that it is possible to detect mosaic fraction only if it is greater than the sequencing error rate generated at various steps of NGS, including library preparation, PCR amplification, and sequencing.15,33 The error rate of routine ES ranges between ~0.1% and 1.0% and cannot be significantly reduced even using the ultradeep sequencing in amplicon-based NGS.33,34 Recent studies have shown that joint analyses of library-level replicates can reduce the false positive signals and facilitate a robust identification of mosaic variants with higher sensitivity and specificity.35

To remove variants that were erroneously called as heterozygous in the probands, we have used conservative filtering criteria (based on the fixed VAF thresholds, i.e., 30% < VAF < 70%). In case of detection of parental mosaicism, the additional rationale of using this filter is that highly skewed VAF observed in the proband may indicate the existence of technical biases in a given locus, which increases the chance that a candidate mosaic event in the parental sample is not real. Although this approach helped us to reduce the number of false positives, it may also result in underdetection of variants in regions with depth of coverage (DP) < 50×, in which the VAF of true heterozygous events may fall outside the 30–70% range. Therefore, in other applications, such as de novo variant calling, one should consider using less stringent filters for the heterozygous state, e.g., p value based on the binomial distribution of VAF that is dependent on DP and allows higher variability of VAF in poorly covered regions.

It is challenging to distinguish whether the reported value by GATK CalculateContamination, that was greater than 1% in 15 samples, was caused by the real cross-sample contamination or is due to the increased number of technical artifacts. The reason for this is that the background noise level depends on multiple factors such as DNA polymerase, sequencing and alignment errors, index hopping, or incomplete trimming of the adapters,36 and it may vary between sequencing experiments. Interestingly, other investigators32 who detected signs of contamination in a significant fraction of their analyzed cohort were able to identify a source of contamination only in 17% of samples with the reported contamination >1%. The abovementioned issues further underline the importance of using orthogonal molecular validation methods to confirm low-level somatic mosaicism in parental samples, and to remove most of the potential technical and biological biases.

Using our computational pipeline in the ES data set, we were able to identify and orthogonally validate 27 somatic mosaic variants with low- and very low–level somatic mosaic VAFs in the parents from two cohorts. Our approach enabled detection of mosaic variants with VAF > 5% with high precision (>85%), whereas identification of variants with lower VAFs turned out to be more challenging, with a precision of ~28%. Our data confirm that the presence of a single alternate read in an ES data set is usually an insufficient predictor of somatic mosaicism and more likely denotes a false positive event.15 Our results also indicate that the improvement of precision in the group of candidates with a single alternate read is possible by using additional predictors for filtering, such as the FracSupp value (i.e., the fraction of samples from the BHCMG cohort having at least one alternate read at the position analyzed) (Supplementary Fig. 3).

The real frequency of mosaicism can be biased by technical limitations. For example, too high or too low GC content, predicted probe dimerization, or the presence of runs of consecutive nucleotides at the SNV site can substantially affect nucleotide discrimination, precluding testing of some variants using ddPCR. Insufficient amount of DNA was the main limiting factor for variant validation detection using BDA and ddPCR (Supplementary Table 1). Thus, studies using larger data sets are needed to confirm the utility of our approach.

As somatic mosaic variants may occur at different developmental stages, their distribution may vary substantially among different somatic tissues. However, larger-scale studies of the distribution of mosaicism in different tissues representing the three primary germ layers have not been performed systematically. Growing evidence implicates that whole blood, which is typically tested in the clinical diagnostics setting, may not be the optimal tissue to search for somatic mosaicism.37 A pool of whole blood cells may grow at a relatively faster rate and lead to clonal expansion, especially in older subjects.38 Therefore, mosaic variations in the blood are more likely to be under- or overrepresented, particularly if the variant influences cell survival or growth. We and others have observed that VAFs in nonblood tissues were usually higher than those in blood samples, suggesting that tissues other than blood (e.g., those exhibiting different VAFs) may serve as more optimal tissue to test somatic mosaicism. Our correlation analyses showed that VAFs identified in hair follicles are the least correlated with VAFs assessed in other somatic tissues (Supplementary Fig. 5). However, given that in some cases not all six types of parental or proband tissue were available for screening, the real intertissue distribution of mosaic variants may be unrecognized. Further studies in larger cohorts are needed to estimate the mosaic ratios across different tissues.

In most cases, the levels of parental somatic mosaicism measured using three orthogonal molecular experimental methods were comparable, whereas only in a few samples did the levels vary significantly. The highest consistency of mosaic fraction was observed between BDA and ddPCR results, confirming our previous observations that these methods can be alternatively used for the accurate quantitation of low-level mosaicism. BDA and ddPCR are both more sensitive than NGS-based approaches. BDA was proven to reliably detect variants with VAF as low as 0.1%.31,39 We were able to validate very low-level somatic mosaicism in sample UT0133 with VAF assessed as 0.3% using BDA, 0.3% using ddPCR, and 0.5% using amplicon-based NGS. In the BG samples where the VAFs calculated based on the PCR amplicon NGS data were less than 1.0%, we have elected not to interpret them as real events as they were not verified by any other orthogonal molecular method (Supplementary Table 1).

In conclusion, we describe a customized computational pipeline that enables robust and accurate identification of low- and very low–level parental somatic mosaic variants in ES data that are not detected using standard NGS data processing methods. We show that the number of alternate reads in the parental sample positively correlates with the likelihood of confirming the parental mosaicism in the validation studies. Knowing that a suspected de novo variant may actually be present in a mosaic state in one of the parents is critical in providing an accurate chance of recurrence risk.