Establishment of reference standards for multifaceted mosaic variant analysis

Detection of somatic mosaicism in non-proliferative cells is a new challenge in genome research, however, the accuracy of current detection strategies remains uncertain due to the lack of a ground truth. Herein, we sought to present a set of ultra-deep sequenced WES data based on reference standards generated by cell line mixtures, providing a total of 386,613 mosaic single-nucleotide variants (SNVs) and insertion-deletion mutations (INDELs) with variant allele frequencies (VAFs) ranging from 0.5% to 56%, as well as 35,113,417 non-variant and 19,936 germline variant sites as a negative control. The whole reference standard set mimics the cumulative aspect of mosaic variant acquisition such as in the early developmental stage owing to the progressive mixing of cell lines with established genotypes, ultimately unveiling 741 possible inter-sample relationships with respect to variant sharing and asymmetry in VAFs. We expect that our reference data will be essential for optimizing the current use of mosaic variant detection strategies and for developing algorithms to enable future improvements.

those germline variants to be unique in only one cell line with explicit reference homozygous genotypes in the other five (see Methods). When MRC5 was employed as an internal reference, each of the five remaining cell lines (RPE, CCD-18co, HBEC30-KT, THLE-2, and FHC) had a unique set of variants among all, and were called V1 to V5, respectively ( Fig. 1a; see Table 1 for the full list). When mixed with MRC5 in different proportions, these unique variants are presented as mosaic mutations at designated VAFs.
Two different types of reference standards are required to enable complete measurement of mosaic detection accuracy, which differ based on the definition of negative controls. Unlike conventional somatic mutations, calling of mosaic variants is susceptible to two different types of errors: (1) calling non-variant sites (e.g., reference allele) and (2) calling germline variants, the latter of which is caused by the unreliability of controls (e.g., variants shared in control samples). Therefore, we provide two different versions of the final sets-set A and set B (Fig. 1b  lower). Set A is the sequencing data of the original materials, M1-M3, which uses 35,113,417 non-variant sites as negative controls. Set B is processed data, where the sequencing data (BAM) of non-variant sites are replaced by those of the internal reference (MRC5) to contain 19,936 germline variants; this is because the original germline compositions of MRC5 are altered in set A by the mixing procedure. Accordingly, testing should be carried out in both sets. The final list of negative controls is presented in Table 3.
Finally, our reference standards allow testing under various realistic biological scenarios by mimicking the structure of multiple lineages in the accumulation of mosaic mutations. There are 741 possible ways to select two within thirty-nine reference data (9 M1, 12 M2, 18 M3), each of which provides distinct inter-sample relationships of variant sharing and their VAF distributions, providing a truth sets for shared and nonshared mosaic variant detection. For example, M1 and M2 share the variant set V1 in varied VAF pairs in respect to the selection of the data, whereas V2 is unique in M1, and V3 and V4 are unique in M2. Likewise, M2 and M3 share V1 and V3. In this regard, M2 and M3 are considered closer in the lineage as they have a more recent common ancestor, which can be exploited in more advanced algorithms. The target VAFs display the tendency to decrease in later mutations 1,32,33 . Exceptions caused by the asymmetric doubling of cells and active replication of stem cells or progenitor cells are also considered 3,16 . Owing to these features, our data constitute one of the most comprehensive, versatile, and robust reference standards ever constructed for variant analysis.

Methods
Sample collection and preparation. Six immortalized normal cell lines (MRC5, RPE, CCD-18co, HBEC30-KT, THLE-2, FHC) were chosen for the construction the reference standards, after confirming their stable genotypes with neutral ploidy, (see Technical Validation). FHC and THLE-2 cells were purchased from the American Type Culture Collection (ATCC). RPE was purchased from Lonza Bioscience. MRC5 and CCD-18co were purchased from the Korea Cell Line Bank. HBEC30-KT is a transformed cell line of HBEC with two genetic alterations (CDK4, hTERT) 34 , and its genomic DNA is available under request. The absence of mycoplasma contamination in all cell lines was verified using the e-Myco VALiD Mycoplasma PCR Detection Kit (LiliF Diagnostics). Cell line authentication was performed using the PowerPlex 18D System (Promega, Cosmogenetech Co., Ltd.) to detect 17 short tandem repeat (STR) loci. The resulting STR profiles were cross-compared and matched with deposited STR information. Since STR profile for RPE, which we purchased from Lonza, was not provided, we attached its STR analysis results along with other cell lines in Online-only Table 1.
To achieve the target ratios, mixing was carried out at a DNA level based on the pre-calculated quantities (see Table 2 for final mixture ratios). Genomic DNA was extracted using a QIAamp DNA Mini Kit, according to the manufacturer's instructions (QIAGEN). A total of 39 mixtures were generated by mixing the genomic DNAs from the six cell lines (see Summary for the procedure). After mixing the genomic DNAs according to the pre-calculated quantities on ice, the mixtures were briefly vortexed, centrifuged, and stored at −20 °C.  www.nature.com/scientificdata www.nature.com/scientificdata/ Whole exome sequencing. Exome capture was carried out for six cell lines and 39 mixtures using SureSelect Human All Exon V6 (Agilent Technologies, Inc., CA, USA). To minimize duplicate reads in ultra-deep sequencing, sequencing libraries were constructed two (cell lines) to four (mixture) times for each sample. The quantities of the constructed libraries were evaluated using the 2100 Bioanalyzer Systems (Agilent Technologies, Inc). WES was conducted for the six initial cell lines and 39 mixtures using Illumina NovaSeq. 6000 (Theragen Bio Inc.), with targeted read depth of 1,100×.
processing of the sequencing data. WES reads in FASTQ data were merged and preprocessed using fastp 35 (0.20.0) to trim overrepresented sequences, such as poly G and adaptors. Reads with low complexity (<30%) were filtered out. The overall sequencing quality was inspected using FastQC (version 0.11.7). All passed reads were aligned to the GRCh38 reference genome using BWA-MEM 36 (0.7.17). Post-processing, including read group addition, marking PCR duplicates, fixation of mate information, and recalibration of base quality score was applied according to the recommendations of GATK best practices using PICARD (2.23.1) and GATK (4.1.8). We also realigned and left-aligned INDELs with GATK (3.8.1 and 4.1.5, respectively) to synchronize INDEL expression in genotyping. Qualimap 2 37 (2.2.1) was used to calculate the sequencing coverage. The overall sequencing quality information of six cell lines and thirty-nine mixtures (set A) is shown in the Online-only Table 2, including the average sequencing coverage, mapping quality, GC contents, and filtering results during the quality control. For SNVs, mutually exclusive variants were collected using the following criteria: (1) variants that were called in both callers and passed the default filtration; (2) variants that were called in only one of the cell lines, with the other five cells being genotyped reference homozygous (i.e., no-call is not allowed); and (3) variants with no signs of copy number alteration (log2 copy number ratio < |0.3| from cnvkit 41 ). For INDELs, similar criteria were applied with an additional rescuing procedure, where single calls (out of two callers) were manually inspected using the Integrative Genomics Viewer 42 (IGV) for the low concordance among callers 26 . Finally, mutually exclusive variants that passed all criteria in RPE, CCD-18co, HBEC30-KT, THLE-2, and FHC were called V1, V2, V3, V4, and V5, respectively (see Summary). At the same time, positions confirmed as reference homozygous (rather than no-call) by both germline callers in all six cell lines have been collected as candidates for negative control. Also, genotyping of the internal reference (MRC5) was conducted and listed for further processes.
Finalizing reference standard sets. Genotypes of the 39 mixtures (within M1, M2, and M3) were theoretically pre-fixed by the genotypes of the six cell lines and their mixture compositions. To finalize the reference standard sets, we conducted a series of post-filtration procedures to remove sites that significantly deviated from the expected coverage and VAFs, particularly from extrinsic and systematic errors. The procedures were applied to two difference sets: set A and set B (see Summary) (Fig. 1c).

Reference standard with non-variant sites as the negative control (set A).
Set A is basically the sequencing data of the 39 mixtures themselves with reference homozygous sites as negative controls that are identified from the genotyping of the six cell lines. Therefore, the finalization of set A only required a few additional filtration steps.
Preprocessed sequencing data were used for the final confirmation of control positive sites based on two filtration criteria: (1) sequencing coverage and (2) variant coverage. Regarding sequencing coverage, raw allele counts were calculated in all targeted positions using SAMtools 43 mplileup (1.10), ignoring soft or hard clips. For each variant site, the mean coverage of the 39 samples was calculated, and low coverage sites (<40×) were removed; these sites should theoretically be variant positions but cannot be used as positive controls because of the low-sequencing coverage. The threshold (40×) for sequencing coverage was determined to secure the number of positive controls as well as the quality of the reference data. With one alternative allele in 40× position, the smallest VAF that can be generated would be 2.5%, and for all variant sets (V1-V5), the proportion of designated VAFs larger than 2.5% among the total in each variant set exceed 50% (V1: 100%, V2: 55%, V3: 100%, V4: 50%, V5: 50%). Regarding variant coverage, for each variant v, variant coverage was defined as (number of samples that actually harbored v)/(number of samples designed to harbor v). Variants with low variant coverage (<20%) were considered to be affected by low-sequencing efficiency and were, thus, removed. For non-variant (negative control) sites, positions with an average coverage of <20× were removed. Moreover, non-variant positions with more than three high-quality (BQ ≥ 30) alternative alleles were filtered out to prevent any interference from experimental or systematic bias (e.g., small subclones generated in the original cell lines), rather than sequencing artifacts. Consequently, sequencing artifacts are projected in VAFs under 10% in negative controls non-variant negative controls, where accurate detection of mosaic variants is hampered 20 .
Reference standard with germline variants as the negative controls (set B). Unlike set A, set B requires an additional process to replace germline variant sites of mixtures with those of internal reference (MRC5). First, we generated thirty-nine baseline-bam files for set B, by down sampling the MRC5 bam file into 1,100×, with random seed for 39 times using PICARD DownsampleSam (2.23.1). Then, all reads embedding the positive control positions in each of thirty-nine of set A (e.g., V1 and V2 positions for M1 data), were extracted using www.nature.com/scientificdata www.nature.com/scientificdata/ bedtools 44 (2.28.0). At the same time, MRC5 reads in the same positions were removed from the down-sampled baseline data. Finally, we merged the extracted reads from each of the thirty-nine set A with the down-sampled MRC5 data where the reads in the exactly same regions were removed. Before the replacement, we verified that the sequenced fragment length, GC content, and quality of bases were comparable for the two types of data, WES reads of MRC5, and 39 mixtures. Consequently, mosaic variants and germline variants of MRC5 coexisted within set B with the replacement.
A similar post-filtration performed for set A was applied to set B. First, sequencing coverage filtration was equally applied. Second, the VAF in each germline variant site was assessed to filter out sites that violate beta-binomial distribution for heterozygote [74, 76 for α, β calculated from MRC5 heterozygous single-nucleotide polymorphisms (SNPs), two tailed, p < 0.01] and homozygote (VAF < 0.9) to consider over-dispersion and capture bias in WES. Lastly, variant coverage was calculated to remove germline variants that were missing in any of mixture samples (variant coverage < 1).

Data records
The raw WES FASTQ files of 6 cell lines and 39 mixtures are available from the Sequence Read Archive under the accession code [PRJNA758606] 45 . Thirty-nine pairs of set A and set B are also available in BAM file format to be readily applied for evaluation of methods. Positive and negative controls of mosaic reference standards are available in GitHub 46 . The expected VAFs and compositions of positive controls in each sample are presented in Table 2.

Technical Validation
Validation of normal cell line stability. We used six normal immortalized cell lines for stability and reproducibility, as they do not continuously acquire small and large variants during cell culture, unlike cancer cell lines. The distribution of heterozygous SNPs detected using Strelka2 annotated with gnomAD (v2.1.1) showed a singular peak at VAF 0.5 in all six cell lines, demonstrating the monoclonality of the materials (Fig. 2a). As positive controls were constructed by mixing independent cell lines, it was important to validate their diploid www.nature.com/scientificdata www.nature.com/scientificdata/ genotypes. Therefore, the overall regions of all six cell lines appeared to be copy number neutral, except the sex chromosomes and entire chromosome 5 of HBEC30-KT, as commonly observed 47 (Fig. 2b). The unique germline variants used for the positive control were selected from copy number neutral regions through CNV analysis (Methods).
Sequencing quality validation. We validated 45 WES data generated in this study, including the sequencing reads of 6 cell lines and 39 mixtures. We calculated the percentage of bases with phred-scaled base quality over 30, establishing an average value of 93.93% and a minimum of 91.82% among all data. The average GC content was 49.87%, with a maximum of 51.27%, thereby depicting a very low rate of bias during library preparation. FastQC and Qualimap were also applied to validate multiple quality of sequenced reads. Sequence quality of bases in read ends had steadily high base quality over 30. Data of both cell lines and mixtures showed high coverage, with more than 1,100× on average (Fig. 2c). We provided WES data with high coverage and quality for cell lines as well as set A to collect reliable germline variants and remove somatic variants with high VAF, which could serve as confounding factors when selected as positive controls. The mean mapping quality and base percentage with high-quality (BQ ≥ 30) of set A are shown in Fig. 2d. We also compared multiple features of reads from 39 set A and MRC5 data, which were merged when generating Set B. However, no significant differences were found, inferring that set B is less likely to have bias of two different sources (Fig. 2e).
Quality Validation of positive and negative control. First, to validate the quality of positive controls, we investigated the correlations between expected VAFs of the design and observed VAFs in set A. Both SNVs and INDELs in the entire range of VAFs had a high coefficient of Pearson correlation between expected VAFs and the median value of observed VAFs among all positions with the same expected VAF (r = 0.97, p < 2.2e-16 and r = 0.91, p < 2.2e-16, respectively, shown in log10 scale in Fig. 3a). In other words, secure collection of germline variants (utilized as positive controls) within high coverage data (1,100×) could eliminate the possible ambiguity in the reference data, which can be originated from sub-clonal mutations acquired during cell culture. Thereafter, we assessed the distribution of germline negative controls in set B. The distribution of heterozygous and homozygous SNPs and INDELs is shown in Fig. 3b. The length of INDELs in positive controls and germline negative controls demonstrated a similar distribution, indicating that they could be comparably adjusted to variant callers for performance evaluation (Fig. 3c). The count of INDELs displayed a resemblance between them and most had a length smaller than 5 base pairs. Finally, we identified the quantitative and qualitative aspects of non-variant negative controls in set A. The raw alternative alleles were counted using SAMtools mplileup. www.nature.com/scientificdata www.nature.com/scientificdata/ It was noteworthy that approximately one-third of the total target positions (10,202,428 in median of 39 reference data) were found to have more than one unexpected alternative allele in the non-variant positions (negative control of set A), in our ultra-high depth data (1,100×). In other words, abundant artifacts, unexpected alternative alleles produced during sequencing process, could have been generated owing to the advantage of multiple independent high coverage sequencing of the biological reference standards. Since detecting mosaic variants with low allele frequencies is extremely challenging, investigating those sites containing various read features would yield meaningful information for their accurate detection. For instance, in Fig. 3d, we demonstrated those sites within the chromosome 1 of the randomly selected sample (M2-5) with their base qualities and VAFs. They had a wide range of base quality, from 0 to 80, and artifacts were concentrated at VAF near 0.001, with a base quality of zero. However, a notable number of artifacts was found with high base quality, and the destructive effect of these artifacts is assumed to be greater in data with low-sequencing depth.

Usage Notes
Each pair of reference data, namely, set A and set B, can be applied to detection methods and the resultant variant calls and their properties can be assessed via a comparison to the list of positive and negative controls provided in GitHub 46 . Evaluation of the true positive calls as well as both types of false positives based on two-types of negative controls, artifacts from set A and germline variant from set B, is possible. We recommend exploiting abundant number of provided reference data for robust evaluation. Although remarkable amount of mosaic variants with varied VAFs (especially lower than 10%) could be provided by means of cell line mixture-based reference standards, each data contains variants in limited number of expected VAFs (e.g., M1-1 has mosaic variants in four expected allele frequencies, 1%, 2%, 4%, and 8%). Hence, data selection with unbiased VAF distribution for their application is essential. The variant compositions as well as allele frequencies of the complete set of samples are shown in Table 2.
The provided reference data can be utilized for versatile analyses for mosaicism detection. For example, down-sampling of ultra-deep WES data (1,100×) will unveil detection accuracy in the lower depth of interest, yielding the information of how the sequencing coverage affect the performance of given methods. Also, variants with diversified VAFs in the provided data would support to reveal the thresholds of sequencing coverage for detecting low VAF variants. Also, accuracy of shared and sample-specific mosaic variant detection can be assessed under varied inter-sample VAF relationships. The reference data provides chances to evaluate and develop new detection algorithms for shared and sample-specific variants. For instance, thirty-nine reference dataset provide chance to assess up to 741 combinations by selecting two samples. Likewise, shared variant analysis among more than three samples are possible in even larger number of cases. Confident set of controls supports robust evaluations, and consequently, the reference data provides valuable opportunities for analyzing various aspects that should be considered in mosaic variant calling.

code availability
The scripts used for constructing reference standards are available in a public repository GitHub 46 (https://github. com/Yonsei-TGIL/Mosaic-Reference-Standards.git) and are accompanied by markdowns for a step-by-step description.