Assessment of the gene mosaicism burden in blood and its implications for immune disorders

There are increasing evidences showing the contribution of somatic genetic variants to non-cancer diseases. However, their detection using massive parallel sequencing methods still has important limitations. In addition, the relative importance and dynamics of somatic variation in healthy tissues are not fully understood. We performed high-depth whole-exome sequencing in 16 samples from patients with a previously determined pathogenic somatic variant for a primary immunodeficiency and tested different variant callers detection ability. Subsequently, we explored the load of somatic variants in the whole blood of these individuals and validated it by amplicon-based deep sequencing. Variant callers allowing low frequency read thresholds were able to detect most of the variants, even at very low frequencies in the tissue. The genetic load of somatic coding variants detectable in whole blood is low, ranging from 1 to 2 variants in our dataset, except for one case with 17 variants compatible with clonal haematopoiesis under genetic drift. Because of the ability we demonstrated to detect this type of genetic variation, and its relevant role in disorders such as primary immunodeficiencies, we suggest considering this model of gene mosaicism in future genetic studies and considering revisiting previous massive parallel sequencing data in patients with negative results.

The distribution and effect of somatic genetic variants in disease has been studied mostly in cancer. However, in the past years, they have also been identified in a wide spectrum of syndromes including neurological disorders as schizophrenia 1 , autism spectrum disorder 2 , Alzheimer [3][4][5][6] or Huntington disease 7 , coronary heart disease and stroke 8 and kidney diseases such as the Alport syndrome [9][10][11] . In fact, at least theoretically, all monogenic diseases could be originated by a postzygotic mutation and the resulting somatic mosaicism. In the field of immunerelated diseases, a remarkable number of somatic variants have been described in monogenic autoinflammatory diseases [12][13][14][15][16][17][18][19][20] , and a recent work has shown its important contribution to these disorders and other primary immunodeficiencies (PIDs) 21 .
Understanding the relative abundance of somatic variants in health is critical to design efficient tools for mosaicism detection in disease studies. Different studies have measured the presence of somatic variation in normal tissues, most assessing the presence of mutations in cancer-driver genes, such as NOTCH1 mutations, which undergo expansion through positive selection [22][23][24] . They reported the colonization of the tissue by mutant clones increasing with age and exposure to mutagenic agents (sun radiation, tobacco). Other studies, based on single cell 25 or transcriptome analysis 26 revealed tissue-specific patterns of somatic variant distribution, as well as negative selection of functional variants in non-cancer samples.
The detection of somatic variants from massive parallel sequencing (MPS) data presents some difficulties. Standard variant calling methods are based on the presence of germline heterozygous mutations in about 50% of the sequencing reads, and may fail to detect somatic variants in allelic imbalance and lower frequencies. Most Samples. The present study included both unique and matched samples from peripheral blood (PB), oral mucosa (OM) and urine (UR) for 12 individuals: (i) 11 unrelated PID patients carrying a pathogenic and previously described somatic variant, and ii) one of the descendants with the same pathogenic variant in germline status (Table 1). In eight individuals, the only analysed sample was PB (S2, S4a, S6, S8, S9, S10 and S11) or OM (S4). In four individuals, we analysed samples from paired tissues: from PB and OM in three patients (S1a-S1b, S3a-S3b and S5a-S5b) and, in the remaining patient, from PB and UR (S7a-S7b).
All of the PID mutations are missense single nucleotide variants (SNVs), and are the disease causing mutation either in the proband or in its offspring, where they are germline variants. The range of variant allele frequencies (VAFs) for the somatic variants previously estimated by ADS 21 ranges from 2.3 to 34.8%.
For patient S5 we included additional samples from urine, oral mucosa, whole blood (before and after anti-IL-1 treatment), and different cell type populations previously isolated by flow cytometry 20 : neutrophils, monocytes, B cells, T CD4 + cells and T CD8 + cells (all pre-treatment).  40 (https:// sourc eforge. net/ proje cts/ varsc an/ files/). FreeBayes and HaplotypeCaller are purely germline callers. SomVarIUS is a caller designed to detect somatic variants in unpaired samples. The rest of them support a single mode and a paired mode. Although in our study we were not analysing cancer samples, we tested the behaviour of variant callers' paired mode in this context with the matched PB-OM and PB-UR samples. We used default parameters for all the callers except for VarScan2, where we lowered the allele frequency threshold of 20% and set the p-value to 1 to retrieve all the possible calls. For HaplotypeCaller, we first used the default ploidy parameter of 2 and next we considered other ploidy values: 4, 5, 6 and 10.
We performed ADS with rhAmpSeq from Integrated DNA Technologies (IDT, Coralville, USA) to validate the candidate somatic variants. We sequenced every selected position to a mean coverage > 20,000X in a NextSeq Illumina platform in a High Output 2 × 150 paired-end cycles run. The confirmed in blood plus 19 additional candidate somatic variants in S5 were analysed for validation in different tissues and cell population samples. They were sequenced in a MiSeq v3 run (2 × 300) to a final depth > 155,000X. We used BWA-mem version 0.7.16a-r1181 to map the fastq files to the human reference genome hg38 (UCSC). We then used pysam version 0.15.2 (https:// github. com/ pysam-devel opers/ pysam) to count the number of reads supporting every allele, requiring a minimum mapping quality of 20 to calculate VAFs.

Results
Detection of somatic pathogenic variants from WES in PID patients. We performed WES in all DNA samples to a mean coverage of 245X (Table 1). The total number of genetic variants differs among the different callers ( Supplementary Fig. S1), mostly because of VarDict and VarScan2, the two callers with relaxed allelic imbalance parameters, which called more than 200,000 variants each. These two callers also show high heterogeneity across samples, which correlates with sequencing depth, as expected in MPS experiments. The amount of overlapping variants across the different callers is uneven, especially for SomVarIUS, due to the low number of variants it calls. The number of concordant variants between VarDict and VarScan2 is also low, probably because VarDict calls 3-4 times the number of indels of Varscan2 and because of discrepancies calling low frequency variants ( Supplementary Fig. S2). Figure 1 shows which known causal somatic variants (Table 1) are detected by each software. FreeBayes and HaplotypeCaller have the lowest detection ratios. For the rest, the ability of detection is similar and seems to depend on the frequency of the mutations, along with the coverage of the sample and the mapping quality. The S1a causal variant has not been called by any software, but visual inspection of the mapped reads revealed that none of them supported the alternative allele ( Supplementary Fig. S3). Excluding it, VarDict and VarScan2 were able to detect all the causal variants. To increase the power of detection of HaplotypeCaller, we explored the effect of modifying the ploidy parameter. We used ploidy 2 (default), 4, 5, 6 and 10 in order to call variants with lower frequencies than expected in a germline scenario. This parameter is normally tuned when working with organisms with ploidies different than 2. For instance, decaploid plants have been reported 48,49 , and genotypes 0/0/0/0/0/0/0/0/0/1 are possible. This way, the increase of the ploidy parameter makes HaplotypeCaller more sensible to low frequency variants. The percentage of detected variants increased sequentially with the ploidy parameter, although some remained undetected. HaplotypeCaller seems to be sensitive to mapping quality as in the case of the ELANE region ( Supplementary Fig. S4), where a variant with moderate frequency is not detected by this caller. Interestingly, we lost one variant using ploidy 10 while it was previously detected with ploidies 5 and 6 due to memory reasons ( Fig. 1, expanded in Supplementary Fig. S5).
Next, we assessed the performance of the five variant callers including a paired mode in the four cases with available paired samples (S1, S3, S5 and S7), where the same variant is present in two tissues with different frequencies. As a general trend, there is no improvement of the detection rate when using the paired mode compared to the single mode, probably because of the small differences in allele frequency between tissues. The use of one or the other paired sample as cancer/healthy tissue does not seem to affect the capacity of detection. Again, VarDict and VarScan2 showed the best detection ratios (Supplementary Table S1). called, a set of different filters is commonly applied to reduce the number of false positives. This is a crucial issue in the study of monogenic syndromes, where the aim is moving from the approximately 20,000 genetic variants identified in a typical WES to one or a few candidate variants. Relaxing or disabling the VAF filters to increase the ability to detect causal somatic variants, as we did in this study, produces an important increase of the number of mutations per individual, making this process highly recommended. We evaluated the ability to identify the known pathogenic variants after applying the standard filters to the variants called by VarDict and VarScan2, the most successful programs in calling them (Fig. 1). We started by intersecting the two VCF files for every individual, given that in all cases the true variants were retained by both of them. Next, we applied a set of additional filters sequentially (see below), checking in every step if the causal variant was retained or filtered out (Table 2). First, we filtered out SNPs located 6 bp around indels. Second, as suggested previously 50 , we restricted our analysis to the 1000 Genomes Project strict mask filter. Third, we required the positions to be covered by, at least, 50 reads (DP > 50) and to show a minimum quality value of 25 (QUAL > 25). Fourth, we only kept loss of function and missense variants. Fifth, we applied a stringent population allele frequency threshold of 0.001 in gnomAD. With a high probability, a somatic variant will be absent in the population because of its de novo nature, although the possibility of having a recurrent mutation cannot be excluded. Sixth, following the recommendations in the literature, we kept variants with a likely damaging predicted effect (CADD > 15 44 ) and a high evolutionary conservation score, as an indicator of its functional importance (GERP > 2 51 ). Seventh, we required at least three reads supporting the alternative allele (VD ≥ 3) in every call. Finally, we used the list of 333 genes of the International Union Of Immunological Societies (IUIS, updated in February 2018) 52 as a set of candidate genes for PIDs. Excluding the causal somatic variant of sample S1a, which was not detected in the sequencing process, 13 out of the 14 somatic mutations were included in the final list of candidate variants. The remaining one (S6), was filtered out because of a GERP value lower than 2.
Mosaicism abundance detection in whole blood. As mentioned above, the consideration of genetic variants deviating from the approximate expectation of 50% read frequency increases substantially the number www.nature.com/scientificreports/ of called variants. In the previous analyses we assessed how many of the true causal variants in 11 PID samples were detected. Now we wonder what proportion of the called variants in these samples corresponds to real postzygotic mutations, and not to sequencing, mapping or calling errors. We restricted the analysis to coding variants, more prone to have a functional impact and to be related to monogenic disorders. For this, we applied the following filters to select the variants more plausible to be validated as true: we intersected the SNPs called by VarDict and VarScan2, removed SNPs located 6 bps around indels, applied 1000G strict mask, required a minimum depth of 50 and a minimum quality of 25, removed variants classified as common in dbSNP and those shared among samples in the study, removed SNPs located within homopolymers, and removed SNPs in positions where the mappability was not perfect. We also performed a binomial test to exclude potential heterozygous mutations, to estimate the possibility of the observed number of reads supporting the alternative allele given the total number of reads. We finally required a minimum number of reads supporting the alternative allele of 7, due to the large number of variants below this threshold in our dataset ( Supplementary Fig. S6). After this filtering, we moved from the approximately 250,000 variants called per individual to around 40. (Fig. 2), representing a total of 461 candidate somatic variants (Supplementary Table S2) Figure 3. VAF of validated somatic variants in S5 patient per tissue and cell type. Green is used for the group of variants with higher VAF (around 24%), red for those with intermediate VAF (around 4%) and blue for those with low VAF (around 1.5%, the only group present in B cells). Of note, there is one variant in the X chromosome whose frequency has been divided by 2 in order to visualize it grouped with the others. www.nature.com/scientificreports/ missense variant in ODF2 (S1), SHISHA2 (S2), STRIP1 (S3) and IL2RG (S11), and one synonymous variant in CACNAS1 (S4) and ROBO4 (S11). Of note, in patient S5 we validated a total of eleven variants: seven missense, being one of them the causal variant in NLRP3, and four synonymous. The twelve variants seemed to cluster in two frequency groups: one with variants of about 25% (including the pathogenic variant) and other with variants about 4.5% (Supplementary Table S2).

Cell type distribution of somatic variants in S5 patient. Given the high number of validated somatic
variants in patient S5, we expanded the analysis selecting nine additional candidate genetic variants. These variants were analysed for validation, along with the twelve previously confirmed, both in the whole blood sample and different cell populations separated by flow cytometry 20 (Table 3). We also added a whole blood DNA extraction obtained after the anti-IL-1 treatment this patient received. In this experiment, the average coverage per position was 158,000X (max = 484,219, min = 16,689, sd = 80,940). We considered that a somatic variant was validated in a given cell type or tissue when the proportion of reads supporting the alternative allele was above 0.30%, a value close to the average error type of sequencing by synthesis technologies, which also varies with features such as sequence context or the specific nucleotide change 53,54 . Six of the nine new genetic variants were validated, with one (chr7:157,614,060) being a germline variant according to its frequency (Table 3). Overall, we detected 17 somatic variants in this patient, 16 protein coding and one intronic (Table 3), now clustered in three groups with similar VAFs around 24%, 4.5% and 1.5% in whole-blood pre-treatment (Fig. 3) and cell type distribution. VAFs changes across different cell types and tissues are coordinated within each group, being the two main groups only present in the myeloid line as well as in urine and cell mucosa, but absent in the lymphoid line. In general, we found higher allele frequencies in monocytes and lower in oral mucosa. The presence of the somatic variants in oral mucosa and urine was produced by leukocyte infiltration, which was detected by flow cytometry 20 . On the other hand, the lowest VAF group of variants are detected in myeloid cells and B cells, but not in T cells. The VAF of all the somatic variants is reduced in the whole-blood sample after the anti-IL-1 treatment (Whole blood 3 post, in Table 3). This decrease is more important for the variants restricted to the myeloid line, and it is likely observed because of the increased proliferation of inflammatory cells, which is now controlled with the treatment 20 .

Discussion
We performed WES of DNA samples from patients with PIDs, carrying variable degree of gene mosaicism and assessed the ability to detect the somatic causal genetic variants by using different tools. Among the eight variant callers tested, VarDict and VarScan showed the higher detection rates of the causal somatic variants. The rest of the callers designed for somatic variant detection (MuTect2, SomVarIUS and Strelka2) mainly showed some limitations with the lower frequency variants at lower coverage. FreeBayes and HaplotypeCaller, designed for germline variant detection, failed to detect most of the somatic mutations. However, the performance of HaplotypeCaller increases when modifying the ploidy parameter, devised for non-diploid organisms and which allowed retrieving variants with less frequency than the expected 50% in the germline. Of interest, the efficiency of the five callers including a paired mode did not increase when using paired samples, probably because of the small frequency difference between the two samples carrying the same mutation.
Allele frequency is the main limitation for calling a somatic variant, with the risk of non-capturing the mutation because of its low frequency and/or insufficient coverage. To capture these low frequency variants, sequencing depths should ideally be higher than the commonly average depths achieved in WES studies (60-100X). However, the average coverage value might not be informative enough on the sequencing performance for all genomic regions, given the non-uniformity of the capture process. The use of new metrics including this information has been proposed 55 , which should help to reduce false-negative results. As an example, the NOD2 region is clearly captured more efficiently than the NLRP3 region in our study (Table 1). On the other side, only a few reads supporting the alternative allele seems enough to detect the variant, with as few as 3 (out of 128) for the S10 variant or 7 (out of 97) for the S1b variant (Table 1). Thus, an increase of the sequencing depth to 100-200X is recommendable in cases in which somatic variation is suspected. Higher coverage facilitates the detection of very low frequency variants, but increases the risk of enlarging the list of candidate variants because of approaching the error rate of MPS technologies 56 .
Genetic studies usually implement a set of filters to reduce the number of candidate variants to the causal one or to a small group. This process is a trade-off between reducing the number of false positives (either sequencing or mapping artefacts, and non-causal variants) and false negatives (called but filtered true causal variants). At the risk of missing the causal variant, these filters are essential to determine, at least, a reduced list of candidate genes for monogenic syndromes. In the case of studies like this, where the relaxation of allele frequency thresholds generates a list of up to hundreds of thousands of variants per sample ( Supplementary Fig. S1), this step can be especially critical. After applying commonly used filtering parameters both for sequencing and biological features, only the causal variant in one patient was discarded because of low conservation score (GERP for S6 causal variant: -8.07). In the case of applying more stringent filters, two more variants (S1b and S5a-b) would be missed due to GERP score vale lower than 4 57,58 . On the other side, only S6 causal variant would not pass a CADD threshold of 20.
The final number of candidate genetic variants exceeds by about ten times the number of variants in studies analysing germline variants. Considering the IUIS list of 333 candidate genes for PIDs, this is still quite high, with approximately 0.5 variants per gene in each individual. Therefore, it seems recommendable to restrict the analysis to a reduced set of candidate genes according to the clinical phenotype of each patient. Alternatively, the use of some gene features could also help to reduce the list of candidate variants if there is not any a priori clear candidate. Several gene indexes have been developed to measure their possible contribution to human disease. www.nature.com/scientificreports/ Among them, haploinsufficiency predictions could seem useful for identifying candidate genes in a somatic variant disease model expecting to follow a dominant inheritance pattern. However, all the genes with somatic causal variants included in this study show haploinsufficiency values below the consensus threshold of 0.5, with NLRP3, a gene that is proven to be mutated in different autoinflammatory diseases 59 , showing the highest value of 0.465. In contrast, NLRP3 has been reported as a gene with a high level of intolerance to functional variation (RVIS = − 0.95, in the top 9.38% of genes) 21 .
It is important to consider that exome sequencing was performed in DNA samples obtained from peripheral blood. Therefore, only somatic variants present in the major cell populations in blood can be detected. Neutrophils represent more than half of the nucleated blood cells (55-75%) in healthy individuals, while lymphocytes represent around 20% (from which T cells are ~ 70%, B cells are just ~ 20%, and NK cells ~ 10%) 60 . Thus, for early postzygotic mutations, the capacity of detection will most probably not be affected by the cell type implicated in the disorder, since the variant will have similar frequencies in all cell populations. In contrast, for later onset mutations restricted to particular lineages, the mutation will only be detectable if present in the major cell populations of the analysed tissue. Therefore, for immune disorders, the probability of detecting a causal variant from whole-blood extraction analysis will be much higher in those produced by alteration in the myeloid cells, such as in autoinflammatory disorders, than in the lymphoid cells. This fact can partially explain the larger number of reported cases in autoinflammatory disorders 21 compared to other PIDs, as well as the lack of success in the identification of somatic variants in lymphoid immunodeficiencies such as CVID 61 . In these latter situations, it is expected that a big proportion of somatic causal variants would only be detectable if the analysis is restricted to particular cell types. Thus, cell subsets isolation can be essential to the identification and/or the validation of somatic genetic variants in these less represented cell types.
Beyond the detection of the known causal variants, the detected load of coding variants per exome was very low. Except for S5, all the individuals carry none or only one somatic variant additionally to the causal variant. The vast majority of candidate variants were false positives, even if they passed the mapping and quality filters. Comparing our results to other studies is not straightforward because of the differences in the methodologies used and the scanned VAFs, as well as the conceptual approach and targeted regions (see "Introduction"). A whole-genome sequencing (WGS) data analysis of 11,262 blood samples revealed a median number of three mosaic mutations for younger individuals, increasing after 35-45 years of age, and considering 20 somatic variants as the threshold for clonal expansion, that affecting 12.5% of the individuals 62 . Although the minimum detectable VAF of the study was limited because of the 34.8X mean coverage, the results seem concordant with the low number of somatic variants described in our WES deep sequencing approach. In addition to scanning a wide range of VAFs, we validated our results by ADS, which confirmed the low number of somatic coding variants detectable in blood. At a finer level, the total number of somatic variants per cell has been estimated in single-cell studies 25,26 , although most of this variation would remain undetected when the whole tissue is analysed. In fact, when much lower frequencies have been scanned (VAF ≥ 0.0001), it has been shown that clonal haematopoiesis is present in up to 97% of middle-aged people 63 . However, in absence of positive selection on a given mutation, only those that occurred earlier would reach detectable frequencies.
We identified a particular patient with an excess of validated variants compared to the others. S5 is the oldest individual of our dataset (64 years old), although another individual of similar age was also included in this study. Especially for the higher VAF group of five variants (which includes the causal one in NLRP3), the frequency pattern is quite uniform, except for one of the variants in chromosome X (chrX:71,537,899), with lower frequency in monocytes. The presence of the genetic variants in the lowest frequency cluster in cells of the myeloid lineage and in B cells, but not in T cells, could be explained by its origin in adult hematopoietic stem cells generating multineage outputs 64 . Because of the seemingly aggrupation in three different clusters of frequencies and cell type distribution, we propose simultaneous occurrence and clonal expansion as the most parsimonious explanation. However, none of the genes with somatic variants in S5 (Table 3) seems to be related with cellular proliferation that could be linked to an adaptive advantage of a clone of cells, and we also discarded the presence of additional candidate variant in DNMT3A, TET2 and ASXL1 genes, known to be implicated in hematologic malignancies 8,65 . In fact, in the aforementioned study of WGS of 11,262 individuals 62 only 12.6% of the cases of clonal haematopoiesis had detectable cancer driver mutations. Thus, on the rest of cases as well as for S5, clonal haematopoiesis could be produced by genetic drift, as suggested in simulation analysis 66 . In contrast, a recent study 67 proposes positive selection being the major driving force of clonal haematopoiesis, and that it would take more than 2000 years for a mutation to reach a VAF > 1% by only drift. However, our results do not seem to fit to this explanation, because of the abovementioned gene location as well as the presence of synonymous and intronic variants.
Finally, although we believe that our study contributes to the understanding of the burden of functional somatic mutations in blood and provides some practical advice on its detection, we would like to acknowledge some limitations of our approach. Allele frequency and sequencing depth are the two main limiting factors to detect a somatic variant as shown in our case by the failure to detect a variant with VAF < 3%. Also, the number of genetic variants depends on the selected software, that show a limited level of overlapping among them. In this sense, we recommend an inclusive strategy by using the less stringent callers or parameters, followed by a filtering strategy based on sequencing and mapping features. However, even by using stringent filters, the capacity of detection of causal variants will be mostly limited to previously known candidate or related genes, given the excessive number of variants when considering the whole exome. Gene functional relevance or mutation tolerance indexes could be used to reduce the number of candidate genes, but they also show limited applicability. Of importance, we also acknowledge the limitations derived from the small size of our cohort which, while allowing the study of somatic variant discovery, makes it difficult to draw conclusions in terms of dynamics of somatic variation.

Conclusions
The detectable genetic load of somatic coding variants in blood is low. A moderate increase of the commonly achieved depths in exome sequencing analyses can be enough to detect most of these variants at frequencies above the technology error rate, for which we recommend using variant callers sensitive to low VAF. Of importance, the high proportion of false positives makes mandatory their validation which will also provide a better estimation of the VAF. Given both the feasibility of this approach and the reported contribution of gene mosaicism to PIDs 21 , we think that this model should be considered in future sequencing studies. It can be of special interest for those disorders related to major cell populations in blood, such as autoinflammatory diseases. We also suggest reanalysing data of undiagnosed patients, especially those where the inheritance pattern in the pedigree and/or the clinical features of the patient might fit this model. Because of the high number of possible somatic variants called per individual, even after applying stringent filters, it is advisable to restrict the analysis to a set of candidate genes defined according to the clinical phenotype. Finally, our results are in agreement with the existence of clonal haematopoiesis produced by drift, and that can be related to non-cancer disorders.

Data availability
The datasets generated during and analysed during the current study are available in the European Nucleotide Archive (