Detection of somatic mosaicism in non-proliferative cells is a new challenge in genome research, however, the accuracy of current detection strategies remains uncertain due to the lack of a ground truth. Herein, we sought to present a set of ultra-deep sequenced WES data based on reference standards generated by cell line mixtures, providing a total of 386,613 mosaic single-nucleotide variants (SNVs) and insertion-deletion mutations (INDELs) with variant allele frequencies (VAFs) ranging from 0.5% to 56%, as well as 35,113,417 non-variant and 19,936 germline variant sites as a negative control. The whole reference standard set mimics the cumulative aspect of mosaic variant acquisition such as in the early developmental stage owing to the progressive mixing of cell lines with established genotypes, ultimately unveiling 741 possible inter-sample relationships with respect to variant sharing and asymmetry in VAFs. We expect that our reference data will be essential for optimizing the current use of mosaic variant detection strategies and for developing algorithms to enable future improvements.
Sample Characteristic - Organism
Sample Characteristic - Environment
Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.16970041
Background & Summary
After conception, postzygotic mutations continuously occur throughout life in humans, causing somatic mosaicism in an individual1,2. The variant type, time of origination, and locations of the mosaic mutations result in unique mosaic patterns in a combinatorial manner and further affect phenotypes, including various noncancerous diseases3,4,5,6,7,8,9,10,11,12. Several efforts have, thus, been made to identify the mutational landscape and mechanisms underlying the mosaic mutations13,14,15,16,17.
From a technical aspect, the accurate detection of mutations is at the core of the mosaicism research. To date, conventional bulk sequencing has mainly been exploited by utilizing or modifying variant detection algorithms developed for calling clonal variants, such as cancer mutations6,18,19. However, successful application to mosaicism has been obstructed by many challenges, such as low variant allele frequencies (VAF < 10%)14,17,20,21 and ambiguity in the use of a control (e.g., variants can exist in control samples by shared lineages in development)14,17. Moreover, fundamentally, there is a severe lack of platforms or materials, known as reference standards, that can be used to measure the detection accuracy of given algorithms22, thereby amplifying the confusion regarding the optimal use of tools or algorithms and their reliability. Constructing a standard reference is, thus, a critical first step and serves as the basis for analytical validation and benchmarks for germline and somatic mutations23,24,25,26,27,28,29,30. Furthermore, securing a reference standard for mosaic mutations is urgently needed to enable more advanced research.
Herein, we generated robust, large-scale, and cell line mixture-based reference standards using 386,613 single-nucleotide variants (SNVs) and insertion-deletion mutations (INDELs) as positive controls and 35,133,353 negative control positions. The workflow for generating the standard materials and for variant site identification is displayed in Fig. 1. The overall idea for the construction aligns with our previous study31, as unique germline variants among independent genotypes serve as mosaic variants when mixed in the desired proportions. Initially, six normal cell lines (MRC5, RPE, CCD-18co, HBEC30-KT, THLE-2, and FHC) were prepared and sequenced (1,100 × WES) to identify a set of mutually exclusive germline variants. We confirmed those germline variants to be unique in only one cell line with explicit reference homozygous genotypes in the other five (see Methods). When MRC5 was employed as an internal reference, each of the five remaining cell lines (RPE, CCD-18co, HBEC30-KT, THLE-2, and FHC) had a unique set of variants among all, and were called V1 to V5, respectively (Fig. 1a; see Table 1 for the full list). When mixed with MRC5 in different proportions, these unique variants are presented as mosaic mutations at designated VAFs.
The mixing procedure was systematically designed to cover a wide range of VAFs and various variant sharing scenarios (Fig. 1b). Importantly, common (i.e., acquired before the lineage separation of two samples) and lineage-specific (i.e., acquired after the lineage separation) variants compose an internal hierarchical structure of mosaic genotypes in an organism, mimicking the cumulative aspect of mosaic variant acquisition from early (e.g., developmental stage) to late (e.g., recent). RPE was mixed into the internal reference (MRC5) at three different ratios (8, 19.2, and 56%) to enable the presentation of the variants in RPE (V1) at six different VAFs (4, 8, 9.6, 19.2, 28, and 56%), depending on the zygosity (hetero- or homozygous). Similarly, CCD18-co (V2) and HBEC30-KT (V3) were added into the MRC5/RPE mixture at four and six different ratios, respectively. Finally, THLE-2 (V4) and FHC (V5) were added into the MRC5/RPE/HBEC30 mixture at two and three different ratios, respectively (Fig. 1b upper). After the procedure, three final classes of products were generated: M1 (the mixture MRC5/RPE/CCD18-co), M2 (MRC5/RPE/HBEC30-KT/THLE-2), and M3 (MRC5/RPE/HBEC30-KT/FHC). M1 contains the variant sets V1 and V2; M2 contains V1, V3, and V4; and M3 contains V1, V3, and V5, whose VAFs varied according to the mixing ratios within the classes. Of the 12 (3 in RPE × 4 in CCD-18co), 36 (3 in RPE × 6 in HBEC30 × 2 in THLE-2), and 54 (3 in RPE × 6 in HBEC30 × 3 in FHC) possible products in classes M1–M3, 9, 12, and 18 were selected for redundancy and covering efficiency, and subsequently sequenced to ultra-high coverage (1,100×) whole-exome sequencing (WES; see Table 2 for the full list). Overall, 9,657, 7,566, and 11,606 positive control variants were included in M1–M3, respectively, with a wide range of VAFs (0.5–56%), particularly focusing on low frequencies (<10%) (Table 2).
Two different types of reference standards are required to enable complete measurement of mosaic detection accuracy, which differ based on the definition of negative controls. Unlike conventional somatic mutations, calling of mosaic variants is susceptible to two different types of errors: (1) calling non-variant sites (e.g., reference allele) and (2) calling germline variants, the latter of which is caused by the unreliability of controls (e.g., variants shared in control samples). Therefore, we provide two different versions of the final sets—set A and set B (Fig. 1b lower). Set A is the sequencing data of the original materials, M1–M3, which uses 35,113,417 non-variant sites as negative controls. Set B is processed data, where the sequencing data (BAM) of non-variant sites are replaced by those of the internal reference (MRC5) to contain 19,936 germline variants; this is because the original germline compositions of MRC5 are altered in set A by the mixing procedure. Accordingly, testing should be carried out in both sets. The final list of negative controls is presented in Table 3.
Finally, our reference standards allow testing under various realistic biological scenarios by mimicking the structure of multiple lineages in the accumulation of mosaic mutations. There are 741 possible ways to select two within thirty-nine reference data (9 M1, 12 M2, 18 M3), each of which provides distinct inter-sample relationships of variant sharing and their VAF distributions, providing a truth sets for shared and nonshared mosaic variant detection. For example, M1 and M2 share the variant set V1 in varied VAF pairs in respect to the selection of the data, whereas V2 is unique in M1, and V3 and V4 are unique in M2. Likewise, M2 and M3 share V1 and V3. In this regard, M2 and M3 are considered closer in the lineage as they have a more recent common ancestor, which can be exploited in more advanced algorithms. The target VAFs display the tendency to decrease in later mutations1,32,33. Exceptions caused by the asymmetric doubling of cells and active replication of stem cells or progenitor cells are also considered3,16. Owing to these features, our data constitute one of the most comprehensive, versatile, and robust reference standards ever constructed for variant analysis.
Sample collection and preparation
Six immortalized normal cell lines (MRC5, RPE, CCD-18co, HBEC30-KT, THLE-2, FHC) were chosen for the construction the reference standards, after confirming their stable genotypes with neutral ploidy, (see Technical Validation). FHC and THLE-2 cells were purchased from the American Type Culture Collection (ATCC). RPE was purchased from Lonza Bioscience. MRC5 and CCD-18co were purchased from the Korea Cell Line Bank. HBEC30-KT is a transformed cell line of HBEC with two genetic alterations (CDK4, hTERT)34, and its genomic DNA is available under request. The absence of mycoplasma contamination in all cell lines was verified using the e-Myco VALiD Mycoplasma PCR Detection Kit (LiliF Diagnostics). Cell line authentication was performed using the PowerPlex 18D System (Promega, Cosmogenetech Co., Ltd.) to detect 17 short tandem repeat (STR) loci. The resulting STR profiles were cross-compared and matched with deposited STR information. Since STR profile for RPE, which we purchased from Lonza, was not provided, we attached its STR analysis results along with other cell lines in Online-only Table 1.
All cell lines were cultured in a humidified environment in the presence of 5% CO2 at 37 °C. FHC cells were grown in DMEM:F12 (Gibco) with 25 mM HEPES (Gibco), 0.005 mg/mL insulin, 0.005 mg/mL transferrin, 100 ng/mL hydrocortisone, 20 ng/mL human recombinant EGF (Thermo Fisher), 10 ng/mL cholera toxin, 10% fetal bovine serum (Gibco), and 1% penicillin–streptomycin (Invitrogen). THLE-2 cells were grown in BEBM (Lonza) supplemented with BEGM Bronchial Epithelial SingleQuots Kit (excluding GA-1000, Lonza), 10% fetal bovine serum, and 1% penicillin–streptomycin. RPE cells were grown in RtEBM (Lonza) supplemented with RtEGM SingleQuots Supplement Pack (Lonza) and 1% penicillin–streptomycin. MRC5 cells were grown in MEM (Gibco) with 25 mM HEPES, 25 mM NaHCO3, 10% fetal bovine serum, and 1% penicillin–streptomycin. CCD-18co cells were grown in DMEM with L-glutamine (300 mg/L, Gibco), 25 mM HEPES, 25 mM NaHCO3, 10% fetal bovine serum, and 1% penicillin–streptomycin. HBEC30-KT cells were grown in ACL4 media comprising RPMI 1640 medium supplemented with 0.02 mg/mL insulin, 0.01 mg/mL transferrin, 25 nM sodium selenite, 50 nM hydrocortisone, 10 mM HEPES, 1 ng/mL EGF, 0.01 mM ethanolamine, 0.01 mM O-phosphorylethanolamine, 0.1 nM triiodothyronine, 2 mg/mL BSA, 0.5 mM sodium pyruvate, 2% fetal bovine serum, and 1% penicillin–streptomycin.
To achieve the target ratios, mixing was carried out at a DNA level based on the pre-calculated quantities (see Table 2 for final mixture ratios). Genomic DNA was extracted using a QIAamp DNA Mini Kit, according to the manufacturer’s instructions (QIAGEN). A total of 39 mixtures were generated by mixing the genomic DNAs from the six cell lines (see Summary for the procedure). After mixing the genomic DNAs according to the pre-calculated quantities on ice, the mixtures were briefly vortexed, centrifuged, and stored at −20 °C.
Whole exome sequencing
Exome capture was carried out for six cell lines and 39 mixtures using SureSelect Human All Exon V6 (Agilent Technologies, Inc., CA, USA). To minimize duplicate reads in ultra-deep sequencing, sequencing libraries were constructed two (cell lines) to four (mixture) times for each sample. The quantities of the constructed libraries were evaluated using the 2100 Bioanalyzer Systems (Agilent Technologies, Inc). WES was conducted for the six initial cell lines and 39 mixtures using Illumina NovaSeq. 6000 (Theragen Bio Inc.), with targeted read depth of 1,100×.
Processing of the sequencing data
WES reads in FASTQ data were merged and preprocessed using fastp35 (0.20.0) to trim overrepresented sequences, such as poly G and adaptors. Reads with low complexity (<30%) were filtered out. The overall sequencing quality was inspected using FastQC (version 0.11.7). All passed reads were aligned to the GRCh38 reference genome using BWA-MEM36 (0.7.17). Post-processing, including read group addition, marking PCR duplicates, fixation of mate information, and recalibration of base quality score was applied according to the recommendations of GATK best practices using PICARD (2.23.1) and GATK (4.1.8). We also realigned and left-aligned INDELs with GATK (3.8.1 and 4.1.5, respectively) to synchronize INDEL expression in genotyping. Qualimap 237 (2.2.1) was used to calculate the sequencing coverage. The overall sequencing quality information of six cell lines and thirty-nine mixtures (set A) is shown in the Online-only Table 2, including the average sequencing coverage, mapping quality, GC contents, and filtering results during the quality control.
Genotyping of cell lines
Genotyping of the six cell lines was carried out using two robust germline variant callers: Strelka238 (2.9.10) and DeepVariant39 (1.0.0), as they showed high accuracy (e.g., F1 scores) for detecting germline SNVs and INDELs26,40, for autosomal chromosomes, except chr5 (excluded by the copy number variation (CNV) identified in HBEC30, see Technical Validation). Mutually exclusive SNVs and INDELs (i.e., variants exist in only one cell line out of six) were marked as variant sets (V1–V5, see Summary) and were further considered as mosaic variants after mixing.
For SNVs, mutually exclusive variants were collected using the following criteria: (1) variants that were called in both callers and passed the default filtration; (2) variants that were called in only one of the cell lines, with the other five cells being genotyped reference homozygous (i.e., no-call is not allowed); and (3) variants with no signs of copy number alteration (log2 copy number ratio < |0.3| from cnvkit41). For INDELs, similar criteria were applied with an additional rescuing procedure, where single calls (out of two callers) were manually inspected using the Integrative Genomics Viewer42 (IGV) for the low concordance among callers26. Finally, mutually exclusive variants that passed all criteria in RPE, CCD-18co, HBEC30-KT, THLE-2, and FHC were called V1, V2, V3, V4, and V5, respectively (see Summary). At the same time, positions confirmed as reference homozygous (rather than no-call) by both germline callers in all six cell lines have been collected as candidates for negative control. Also, genotyping of the internal reference (MRC5) was conducted and listed for further processes.
Finalizing reference standard sets
Genotypes of the 39 mixtures (within M1, M2, and M3) were theoretically pre-fixed by the genotypes of the six cell lines and their mixture compositions. To finalize the reference standard sets, we conducted a series of post-filtration procedures to remove sites that significantly deviated from the expected coverage and VAFs, particularly from extrinsic and systematic errors. The procedures were applied to two difference sets: set A and set B (see Summary) (Fig. 1c).
Reference standard with non-variant sites as the negative control (set A)
Set A is basically the sequencing data of the 39 mixtures themselves with reference homozygous sites as negative controls that are identified from the genotyping of the six cell lines. Therefore, the finalization of set A only required a few additional filtration steps.
Preprocessed sequencing data were used for the final confirmation of control positive sites based on two filtration criteria: (1) sequencing coverage and (2) variant coverage. Regarding sequencing coverage, raw allele counts were calculated in all targeted positions using SAMtools43 mplileup (1.10), ignoring soft or hard clips. For each variant site, the mean coverage of the 39 samples was calculated, and low coverage sites (<40×) were removed; these sites should theoretically be variant positions but cannot be used as positive controls because of the low-sequencing coverage. The threshold (40×) for sequencing coverage was determined to secure the number of positive controls as well as the quality of the reference data. With one alternative allele in 40× position, the smallest VAF that can be generated would be 2.5%, and for all variant sets (V1-V5), the proportion of designated VAFs larger than 2.5% among the total in each variant set exceed 50% (V1: 100%, V2: 55%, V3: 100%, V4: 50%, V5: 50%). Regarding variant coverage, for each variant v, variant coverage was defined as (number of samples that actually harbored v)/(number of samples designed to harbor v). Variants with low variant coverage (<20%) were considered to be affected by low-sequencing efficiency and were, thus, removed. For non-variant (negative control) sites, positions with an average coverage of <20× were removed. Moreover, non-variant positions with more than three high-quality (BQ ≥ 30) alternative alleles were filtered out to prevent any interference from experimental or systematic bias (e.g., small subclones generated in the original cell lines), rather than sequencing artifacts. Consequently, sequencing artifacts are projected in VAFs under 10% in negative controls non-variant negative controls, where accurate detection of mosaic variants is hampered20.
Reference standard with germline variants as the negative controls (set B)
Unlike set A, set B requires an additional process to replace germline variant sites of mixtures with those of internal reference (MRC5). First, we generated thirty-nine baseline-bam files for set B, by down sampling the MRC5 bam file into 1,100×, with random seed for 39 times using PICARD DownsampleSam (2.23.1). Then, all reads embedding the positive control positions in each of thirty-nine of set A (e.g., V1 and V2 positions for M1 data), were extracted using bedtools44 (2.28.0). At the same time, MRC5 reads in the same positions were removed from the down-sampled baseline data. Finally, we merged the extracted reads from each of the thirty-nine set A with the down-sampled MRC5 data where the reads in the exactly same regions were removed. Before the replacement, we verified that the sequenced fragment length, GC content, and quality of bases were comparable for the two types of data, WES reads of MRC5, and 39 mixtures. Consequently, mosaic variants and germline variants of MRC5 coexisted within set B with the replacement.
A similar post-filtration performed for set A was applied to set B. First, sequencing coverage filtration was equally applied. Second, the VAF in each germline variant site was assessed to filter out sites that violate beta-binomial distribution for heterozygote [74, 76 for α, β calculated from MRC5 heterozygous single-nucleotide polymorphisms (SNPs), two tailed, p < 0.01] and homozygote (VAF < 0.9) to consider over-dispersion and capture bias in WES. Lastly, variant coverage was calculated to remove germline variants that were missing in any of mixture samples (variant coverage < 1).
The raw WES FASTQ files of 6 cell lines and 39 mixtures are available from the Sequence Read Archive under the accession code [PRJNA758606]45. Thirty-nine pairs of set A and set B are also available in BAM file format to be readily applied for evaluation of methods. Positive and negative controls of mosaic reference standards are available in GitHub46. The expected VAFs and compositions of positive controls in each sample are presented in Table 2.
Validation of normal cell line stability
We used six normal immortalized cell lines for stability and reproducibility, as they do not continuously acquire small and large variants during cell culture, unlike cancer cell lines. The distribution of heterozygous SNPs detected using Strelka2 annotated with gnomAD (v2.1.1) showed a singular peak at VAF 0.5 in all six cell lines, demonstrating the monoclonality of the materials (Fig. 2a). As positive controls were constructed by mixing independent cell lines, it was important to validate their diploid genotypes. Therefore, the overall regions of all six cell lines appeared to be copy number neutral, except the sex chromosomes and entire chromosome 5 of HBEC30-KT, as commonly observed47 (Fig. 2b). The unique germline variants used for the positive control were selected from copy number neutral regions through CNV analysis (Methods).
Sequencing quality validation
We validated 45 WES data generated in this study, including the sequencing reads of 6 cell lines and 39 mixtures. We calculated the percentage of bases with phred-scaled base quality over 30, establishing an average value of 93.93% and a minimum of 91.82% among all data. The average GC content was 49.87%, with a maximum of 51.27%, thereby depicting a very low rate of bias during library preparation. FastQC and Qualimap were also applied to validate multiple quality of sequenced reads. Sequence quality of bases in read ends had steadily high base quality over 30. Data of both cell lines and mixtures showed high coverage, with more than 1,100× on average (Fig. 2c). We provided WES data with high coverage and quality for cell lines as well as set A to collect reliable germline variants and remove somatic variants with high VAF, which could serve as confounding factors when selected as positive controls. The mean mapping quality and base percentage with high-quality (BQ ≥ 30) of set A are shown in Fig. 2d. We also compared multiple features of reads from 39 set A and MRC5 data, which were merged when generating Set B. However, no significant differences were found, inferring that set B is less likely to have bias of two different sources (Fig. 2e).
Quality Validation of positive and negative control
First, to validate the quality of positive controls, we investigated the correlations between expected VAFs of the design and observed VAFs in set A. Both SNVs and INDELs in the entire range of VAFs had a high coefficient of Pearson correlation between expected VAFs and the median value of observed VAFs among all positions with the same expected VAF (r = 0.97, p < 2.2e-16 and r = 0.91, p < 2.2e-16, respectively, shown in log10 scale in Fig. 3a). In other words, secure collection of germline variants (utilized as positive controls) within high coverage data (1,100×) could eliminate the possible ambiguity in the reference data, which can be originated from sub-clonal mutations acquired during cell culture. Thereafter, we assessed the distribution of germline negative controls in set B. The distribution of heterozygous and homozygous SNPs and INDELs is shown in Fig. 3b. The length of INDELs in positive controls and germline negative controls demonstrated a similar distribution, indicating that they could be comparably adjusted to variant callers for performance evaluation (Fig. 3c). The count of INDELs displayed a resemblance between them and most had a length smaller than 5 base pairs. Finally, we identified the quantitative and qualitative aspects of non-variant negative controls in set A. The raw alternative alleles were counted using SAMtools mplileup.
It was noteworthy that approximately one-third of the total target positions (10,202,428 in median of 39 reference data) were found to have more than one unexpected alternative allele in the non-variant positions (negative control of set A), in our ultra-high depth data (1,100×). In other words, abundant artifacts, unexpected alternative alleles produced during sequencing process, could have been generated owing to the advantage of multiple independent high coverage sequencing of the biological reference standards. Since detecting mosaic variants with low allele frequencies is extremely challenging, investigating those sites containing various read features would yield meaningful information for their accurate detection. For instance, in Fig. 3d, we demonstrated those sites within the chromosome 1 of the randomly selected sample (M2–5) with their base qualities and VAFs. They had a wide range of base quality, from 0 to 80, and artifacts were concentrated at VAF near 0.001, with a base quality of zero. However, a notable number of artifacts was found with high base quality, and the destructive effect of these artifacts is assumed to be greater in data with low-sequencing depth.
Each pair of reference data, namely, set A and set B, can be applied to detection methods and the resultant variant calls and their properties can be assessed via a comparison to the list of positive and negative controls provided in GitHub46. Evaluation of the true positive calls as well as both types of false positives based on two-types of negative controls, artifacts from set A and germline variant from set B, is possible. We recommend exploiting abundant number of provided reference data for robust evaluation. Although remarkable amount of mosaic variants with varied VAFs (especially lower than 10%) could be provided by means of cell line mixture-based reference standards, each data contains variants in limited number of expected VAFs (e.g., M1-1 has mosaic variants in four expected allele frequencies, 1%, 2%, 4%, and 8%). Hence, data selection with unbiased VAF distribution for their application is essential. The variant compositions as well as allele frequencies of the complete set of samples are shown in Table 2.
The provided reference data can be utilized for versatile analyses for mosaicism detection. For example, down-sampling of ultra-deep WES data (1,100×) will unveil detection accuracy in the lower depth of interest, yielding the information of how the sequencing coverage affect the performance of given methods. Also, variants with diversified VAFs in the provided data would support to reveal the thresholds of sequencing coverage for detecting low VAF variants. Also, accuracy of shared and sample-specific mosaic variant detection can be assessed under varied inter-sample VAF relationships. The reference data provides chances to evaluate and develop new detection algorithms for shared and sample-specific variants. For instance, thirty-nine reference dataset provide chance to assess up to 741 combinations by selecting two samples. Likewise, shared variant analysis among more than three samples are possible in even larger number of cases. Confident set of controls supports robust evaluations, and consequently, the reference data provides valuable opportunities for analyzing various aspects that should be considered in mosaic variant calling.
The scripts used for constructing reference standards are available in a public repository GitHub46 (https://github.com/Yonsei-TGIL/Mosaic-Reference-Standards.git) and are accompanied by markdowns for a step-by-step description.
Thorpe, J., Osei-Owusu, I. A., Avigdor, B. E., Tupler, R. & Pevsner, J. Mosaicism in Human Health and Disease. Annu Rev Genet 54, 487–510, https://doi.org/10.1146/annurev-genet-041720-093403 (2020).
Martincorena, I. & Campbell, P. J. Somatic mutation in cancer and normal cells. Science 349, 1483–1489, https://doi.org/10.1126/science.aab4082 (2015).
Breuss, M. W. et al. Autism risk in offspring can be assessed through quantification of male sperm mosaicism. Nat Med 26, 143–150, https://doi.org/10.1038/s41591-019-0711-0 (2020).
D’Gama, A. M. & Walsh, C. A. Somatic mosaicism and neurodevelopmental disease. Nat Neurosci 21, 1504–1514, https://doi.org/10.1038/s41593-018-0257-3 (2018).
Freed, D. & Pevsner, J. The Contribution of Mosaic Variants to Autism Spectrum Disorder. PLoS Genet 12, e1006245, https://doi.org/10.1371/journal.pgen.1006245 (2016).
Lim, E. T. et al. Rates, distribution and implications of postzygotic mosaic mutations in autism spectrum disorder. Nat Neurosci 20, 1217–1224, https://doi.org/10.1038/nn.4598 (2017).
Rodin, R. E. et al. The landscape of somatic mutation in cerebral cortex of autistic and neurotypical individuals revealed by ultra-deep whole-genome sequencing. Nat Neurosci 24, 176–185, https://doi.org/10.1038/s41593-020-00765-6 (2021).
de Kock, L. et al. High-sensitivity sequencing reveals multi-organ somatic mosaicism causing DICER1 syndrome. J Med Genet 53, 43–52, https://doi.org/10.1136/jmedgenet-2015-103428 (2016).
Park, J. S. et al. Brain somatic mutations observed in Alzheimer’s disease associated with aging and dysregulation of tau phosphorylation. Nat Commun 10, 3090, https://doi.org/10.1038/s41467-019-11000-7 (2019).
Singh, S. M., Castellani, C. A. & Hill, K. A. Postzygotic Somatic Mutations in the Human Brain Expand the Threshold-Liability Model of Schizophrenia. Front Psychiatry 11, 587162, https://doi.org/10.3389/fpsyt.2020.587162 (2020).
Serra, E. G. et al. Somatic mosaicism and common genetic variation contribute to the risk of very-early-onset inflammatory bowel disease. Nat Commun 11, 995, https://doi.org/10.1038/s41467-019-14275-y (2020).
Zhu, M. et al. Somatic Mutations Increase Hepatic Clonal Fitness and Regeneration in Chronic Liver Disease. Cell 177, 608–621 e612, https://doi.org/10.1016/j.cell.2019.03.026 (2019).
Abyzov, A. et al. One thousand somatic SNVs per skin fibroblast cell set baseline of mosaic mutational load with patterns that suggest proliferative origin. Genome Res 27, 512–523, https://doi.org/10.1101/gr.215517.116 (2017).
Bae, T. et al. Different mutational rates and mechanisms in human cells at pregastrulation and neurogenesis. Science 359, 550–555, https://doi.org/10.1126/science.aan8690 (2018).
Ju, Y. S. et al. Somatic mutations reveal asymmetric cellular dynamics in the early human embryo. Nature 543, 714–718, https://doi.org/10.1038/nature21703 (2017).
Moore, L. et al. The mutational landscape of normal human endometrial epithelium. Nature 580, 640–646, https://doi.org/10.1038/s41586-020-2214-z (2020).
Huang, A. Y. et al. Distinctive types of postzygotic single-nucleotide mosaicisms in healthy individuals revealed by genome-wide profiling of multiple organs. PLoS Genet 14, e1007395, https://doi.org/10.1371/journal.pgen.1007395 (2018).
Martincorena, I. et al. Tumor evolution. High burden and pervasive positive selection of somatic mutations in normal human skin. Science 348, 880–886, https://doi.org/10.1126/science.aaa6806 (2015).
Manheimer, K. B. et al. Robust identification of mosaic variants in congenital heart disease. Hum Genet 137, 183–193, https://doi.org/10.1007/s00439-018-1871-6 (2018).
Dou, Y., Gold, H. D., Luquette, L. J. & Park, P. J. Detecting Somatic Mutations in Normal Cells. Trends Genet 34, 545–557, https://doi.org/10.1016/j.tig.2018.04.003 (2018).
McConnell, M. J. et al. Intersection of diverse neuronal genomes and neuropsychiatric disease: The Brain Somatic Mosaicism Network. Science 356, https://doi.org/10.1126/science.aal1641 (2017).
Hardwick, S. A., Deveson, I. W. & Mercer, T. R. Reference standards for next-generation sequencing. Nat Rev Genet 18, 473–484, https://doi.org/10.1038/nrg.2017.44 (2017).
Krishnan, V. et al. Benchmarking workflows to assess performance and suitability of germline variant calling pipelines in clinical diagnostic assays. BMC Bioinformatics 22, 85, https://doi.org/10.1186/s12859-020-03934-3 (2021).
Cornish, A. & Guda, C. A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference. Biomed Res Int 2015, 456479, https://doi.org/10.1155/2015/456479 (2015).
Chen, Z. et al. Systematic comparison of somatic variant calling performance among different sequencing depth and mutation frequency. Sci Rep 10, 3501, https://doi.org/10.1038/s41598-020-60559-5 (2020).
Chen, J., Li, X., Zhong, H., Meng, Y. & Du, H. Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers. Sci Rep 9, 9345, https://doi.org/10.1038/s41598-019-45835-3 (2019).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat Biotechnol 37, 555–560, https://doi.org/10.1038/s41587-019-0054-x (2019).
Zhao, S., Agafonov, O., Azab, A., Stokowy, T. & Hovig, E. Accuracy and efficiency of germline variant calling pipelines for human genome data. Sci Rep 10, 20222, https://doi.org/10.1038/s41598-020-77218-4 (2020).
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat Biotechnol 38, 1347–1355, https://doi.org/10.1038/s41587-020-0538-8 (2020).
Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol 32, 246–251, https://doi.org/10.1038/nbt.2835 (2014).
Kim, J. et al. The use of technical replication for detection of low-level somatic mutations in next-generation sequencing. Nat Commun 10, 1047, https://doi.org/10.1038/s41467-019-09026-y (2019).
Youssoufian, H. & Pyeritz, R. E. Mechanisms and consequences of somatic mosaicism in humans. Nat Rev Genet 3, 748–758, https://doi.org/10.1038/nrg906 (2002).
Fernandez, L. C., Torres, M. & Real, F. X. Somatic mosaicism: on the road to cancer. Nat Rev Cancer 16, 43–55, https://doi.org/10.1038/nrc.2015.1 (2016).
Sato, M. et al. Human lung epithelial cells progressed to malignancy through specific oncogenic manipulations. Mol Cancer Res 11, 638–650, https://doi.org/10.1158/1541-7786.MCR-12-0634-T (2013).
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890, https://doi.org/10.1093/bioinformatics/bty560 (2018).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760, https://doi.org/10.1093/bioinformatics/btp324 (2009).
Okonechnikov, K., Conesa, A. & Garcia-Alcalde, F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics 32, 292–294, https://doi.org/10.1093/bioinformatics/btv566 (2016).
Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat Methods 15, 591–594, https://doi.org/10.1038/s41592-018-0051-x (2018).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol 36, 983–987, https://doi.org/10.1038/nbt.4235 (2018).
Cooke, D. P., Wedge, D. C. & Lunter, G. A unified haplotype-based method for accurate and comprehensive variant calling. Nat Biotechnol 39, 885–892, https://doi.org/10.1038/s41587-021-00861-3 (2021).
Talevich, E., Shain, A. H., Botton, T. & Bastian, B. C. CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing. PLoS Comput Biol 12, e1004873, https://doi.org/10.1371/journal.pcbi.1004873 (2016).
Robinson, J. T. et al. Integrative genomics viewer. Nat Biotechnol 29, 24–26, https://doi.org/10.1038/nbt.1754 (2011).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079, https://doi.org/10.1093/bioinformatics/btp352 (2009).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842, https://doi.org/10.1093/bioinformatics/btq033 (2010).
NCBI BioProject https://identifiers.org/ncbi/bioproject:PRJNA758606 (2021).
Yoo-Jin Ha, J. K., Kim, J. & Kim, S. Yonsei-TGIL/Mosaic-Reference-Standards: (v1.0.1). Zenodo https://doi.org/10.5281/zenodo.5338953 (2021).
Ramirez, R. D. et al. Immortalization of human bronchial epithelial cells in the absence of viral oncoproteins. Cancer Res 64, 9027–9034, https://doi.org/10.1158/0008-5472.CAN-04-3703 (2004).
This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2019R1A2C2008050), Korea Health Technology R&D project through the Korea Health Industry Development Institute (HI14C1324), and Lung Cancer SPORE P50 (CA070907).
J.D.M. receives licensing fees from the NIH and UTSW for distributing human cell lines.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files associated with this article.
About this article
Cite this article
Ha, YJ., Oh, M.J., Kim, J. et al. Establishment of reference standards for multifaceted mosaic variant analysis. Sci Data 9, 35 (2022). https://doi.org/10.1038/s41597-022-01133-8