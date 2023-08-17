Workflow of Monopogen

Monopogen includes germline and putative somatic SNV calling from single-cell sequencing data. It starts from individual bam files of single-cell sequencing data, produced by scRNA-seq, snRNA-seq, single-nucleus assay for transposase-accessible chromatin using sequencing (snATAC-seq), single-cell DNA-seq, etc. (Fig. 1a). Monopogen leverages LD patterns at the human population level to enhance germline SNV detection and LD patterns at the cell population level to enhance putative somatic SNV detection. Sequencing reads with high alignment mismatches (default four mismatches) are removed. Putative SNVs are detected from pooled (across cells) read alignment wherever an alternative allele is found in at least one read. For SNVs that are present in an external haplotype reference panel, such as the 1000 Genomes phase 3 (1KG3) panel, the input genotype likelihoods (GL) estimated by Samtools are further refined by leveraging LD from the reference panel to account for genotyping uncertainty in sparse sequencing data. The loci showing persistent discordance after LD refinement are used to estimate a sequencing error model for de novo SNV calling (Fig. 1b). For the remaining loci satisfying minimal total sequencing depth and alternative allele frequency cutoffs, a support vector machine (SVM) module is designed to distinguish SNVs from sequencing errors (Fig. 1c and Supplementary Fig. 1a, step 2). Briefly, the SVM module uses a series of variant calling metrics as features. The germline SNVs are set as the positive set, and consecutive de novo SNV chunks (>2 SNVs) are set as the negative set. We extend the machinery of LD refinement from the human population level to the cell population level to detect somatic SNVs that are only present in subpopulations of cells. Briefly, for de novo SNVs passing the SVM filtering, we statistically phase the observed alleles with adjacent germline alleles to estimate the degree of LD in the cell population (Fig. 1d and Supplementary Fig. 1a, steps 3–4; Methods). We assume that only two alleles are present in the cell population and examine only the gain of heterozygosity SNVs. We calculate a probabilistic LD refinement score that quantifies the degree of LD, considering widespread sparseness and allelic dropout in single-cell sequencing data (Methods). The LD refinement score ranges from 0 to 0.5. It is closer to 0 for a germline SNV as it has strong LD with the adjacent germline SNVs, that is, sharing the same two haplotypes in all the cells (Supplementary Fig. 1b). The score is greater than 0 for a somatic SNV as the recently gained somatic allele cosegregates with germline alleles in only a subpopulation of cells (Fig. 1d, Supplementary Fig. 1a, step 4, and Supplementary Fig. 1b). SNVs with larger LD refinement scores are classified as putative somatic SNVs. Their genotypes at single cell or cluster level are further inferred using Monovar (Supplementary Fig. 1a, step 5)18. The germline SNVs from Fig. 1b can be used for global or local ancestry inference (Fig. 1e) or cellular quantitative trait mapping when the sample size is sufficient (Fig. 1f), and the putative somatic SNVs can be used for lineage tracing at cellular or clonal resolution (Fig. 1g).

Monopogen is implemented in Python, automatically splitting the genome into small chunks (defined by the users), performing variant scan and LD refinement in massive parallelization for individual chunks and merging the results (Supplementary Note).

Benchmarking of Monopogen performance on germline SNV calling

We used three single-cell sequencing datasets (snRNA-seq from four retina tissue samples, sci-ATAC-seq from two colon tissue samples and scDNA-seq from one triple-negative breast cancer (TNBC) sample) having matched WGS data to evaluate SNV calling performance. In all these samples, the overall accuracy (Methods) of the Monopogen calls was higher than 95% for the germline SNVs present in the 1KG3 panel, 97% for 5/7 of the samples (Fig. 2a and Supplementary Table 1). The high accuracy is largely due to the LD-based genotyping refinement. The overall accuracy without LD-based refinement for bulk-based SNV callers, such as calls from Samtools, GATK, FreeBayes and Strelka2, was less than 73% on snRNA-seq and sci-ATAC-seq (Supplementary Table 2). Further examination shows that over 85% of the genotyping errors from Monopogen misclassified 0/1 as 1/1 (Supplementary Table 1), due partly to allele drop artifacts in the single-cell data.

Fig. 2: Benchmarking of Monopogen performance in various single-cell sequencing platforms. a, Overall accuracy and SNV detection sensitivity (recall) in representative snRNA-seq (n = 4), sci-ATAC-seq data (n = 2) and scDNA-seq data (n = 1) using matched WGS data as the gold standard, comparing Monopogen against Samtools, GATK, FreeBayes, Strelka2, cellSNP and scAllele. The x axis denotes the overall accuracy and y axis denotes the detection sensitivity (recall). The closer a dot is to the top-right corner, the better the corresponding method has performed. Note, in a for Monopogen evaluation, only the SNVs present in the 1KG3 were considered. b,c, Median sequencing depth of SNVs found from snRNA-seq data (b) and sci-ATAC-seq data (c) over gene annotations. The pie charts show the percentage of SNVs in each category. d, Number of SNVs versus the number of cells in the retina data via downsampling. The x and y axes are on logarithmic scale. Pearson’s correlations were applied to calculate the R and the P values. e, Overall accuracy versus cell number. f. Number of SNVs detected from seven single-cell sequencing datasets. The sequencing coverage was calculated as the \(L\times n/(3.2\times {10}^{9})\), where L is the read length and n is the total number of reads in one sample. Each small dot corresponds to a sample, while each big dot is the mean value of a dataset. All the dots are colored by dataset. The top ellipse covers samples from scATAC-seq data and the bottom ellipse samples from scRNA-seq data. Full size image

In the retina snRNA-seq data, Monopogen detected 827–905 K germline SNVs, achieving a recall of 21% (Fig. 2a and Supplementary Table 1). GATK, Samtools and FreeBayes achieved a recall of 11–20% at the expense of lower accuracy (<73%). Although Strelka2 detected ~25% SNVs, the accuracy was lower than 25%. Most (70.4%) SNVs from Monopogen were detected in intronic regions, only less than 7% in exonic regions (Fig. 2b). As expected, sequencing depth was higher in genes than in intergenic regions. Off-target reads appear sufficiently leveraged to derive accurate genotypes through LD-based refinement.

In addition, Monopogen detected ~100 K new SNVs in the retina snRNA-seq data that are not presented in the 1KG3 panel, after performing sequencing depth filtering (>100) and sequencing error model calibration. The overall accuracy of this set is 35% and is 86% for the subset detected in more than 90% of the transcriptomic clusters determined by Seurat19 (Supplementary Table 3).

In the colon sci-ATAC-seq data, Monopogen detected 752 K to 1.1 M germline SNVs, achieving a recall of 25%. In contrast, the recall for Samtools, GATK and FreeBayes was less than 12%. Strelka2 detected ~30% SNVs with an accuracy lower than 40%. Most (57.4%) of the SNVs from Monopogen were found in intergenic regions and 38.6% in gene regions (Fig. 2c). We also included two SNV callers cellSNP and scAllele that were designed for single-cell sequencing data. cellSNP had the lowest SNV detection (<5%), and scAllele had the lowest accuracy (<10%) across three benchmarking datasets.

Given single-cell sequencing is highly sparse, sequencing coverage is one of most key factors affecting SNV detection (Supplementary Fig. 2a–c). We evaluated Monopogen’s performance on downsampled retina snRNA-seq data containing random subsets of 200–20,000 cells (~29.4 K reads per cell; Supplementary Table 1). We observed a linear relationship between the number of SNVs and cell numbers in a logarithmic scale (Fig. 2d; Pearson correlation coefficient is 0.9). Monopogen detected ~100 K SNVs from only 200 cells and 500 K SNVs from 1,000 cells (Fig. 2d). Despite downsampling, the overall accuracy of Monopogen remained robust to cell number and was always higher than 94% (Fig. 2e). The downsampling sequencing coverage scheme showed a similar pattern to the downsampling cell scheme (Supplementary Fig. 2d,e). The performance of Monopogen was robust to sequencing depth and errors. The overall accuracy had only slight decreases when sequencing error rates were less than 2.5%. Even at an exceedingly high sequencing error rate of 5%, Monopogen still achieved ~85% genotyping accuracy (Supplementary Fig. 2e), demonstrating the efficiency of LD-based genotyping refinement on challenging scenarios.

We further evaluated Monopogen in four other cohorts, which are as follows: human breast cell atlas (HBCA; 20 donor samples), peripheral blood mononuclear cells from Asian Immune Diversity Atlas (AIDA; 20 donor samples), genotype-tissue expression project (GTEx; seven donor samples) and human heart left ventricle atlas (65 samples). These datasets have a variety of cell numbers, number of reads per cell and read length (Supplementary Table 4). To make a fair comparison across datasets, we investigated the relationship between sequencing coverage and number of SNVs. As expected, Monopogen detected more SNVs from single-cell epigenomics sequencing data than from single-cell transcriptomics sequencing data (Fig. 2f). Although these samples do not have matched WGS profiles, there are 54 human left ventricle samples having paired scRNA-seq and scATAC-seq. The genotyping concordance between the two modalities was also as high as 97% (Supplementary Table 5 and Supplementary Fig. 5a), further demonstrating the robustness of Monopogen SNVs calling on various sequencing platforms.

Accurate global and local ancestry inference from single-cell sequencing data

We performed genetic ancestry inference using genotypes called from Monopogen. We projected the Monopogen-called snRNA-seq genotypes and the matched WGS genotypes of the four retina samples, respectively, onto a map, consisting of source samples with East Asia, America, Middle East, Europe, Oceania, Africa and Central/South Asia in the Human Genome Diversity Project (HGDP)20. We found that the PC coordinates were highly consistent between the WGS genotypes and the single-cell genotypes called by Monopogen (Fig. 3a,b). The mapping results were consistent with self-reported ethnicities for all the samples, including three Europeans and a self-reported Hispanic sample. We further performed local ancestry inference using RFMix21. On all the samples, the chromosomal painting results based on single-cell data (Fig. 3c–f and Supplementary Fig. 3) appeared highly consistent with self-reported ethnicities and with those obtained from the WGS data. For example, the source consistency across genomic bins was as high as 0.96 for one of the European samples (19D013; Fig. 3g) and 0.90 for the Hispanic sample (19D015; Fig. 3h). We did observe some genomic bins showing discrepant sources, due largely to sparseness of single-cell-derived SNVs in those regions. The global ancestry inference results remained largely unchanged when downsampling the data to only 200 cells (~29.4 K reads per cell; Supplementary Fig. 4).

Fig. 3: Global and local ancestry inference using single-cell genotypes derived by Monopogen. a,b, Genetic ancestry of the four retina samples using Monopogen genotypes derived from snRNA-seq data (a) and genotypes from matched WGS data (b). Colored dots represent individuals in the HGDP reference panel, and black crosses represent the retina samples. The variance explained by PC1 and PC2 from the HGDP panel was labeled. c,d. Local ancestry inference of a European sample 19D013 using genotypes from the snRNA-seq (c) and the WGS (d) data. The 3,202 phased genotypes from 1KG3 were used as the reference for local ancestry inference. Colors in each chromosome denote the inferred source ancestry with a bin size of 1 centimorgan (cM). e,f, Local ancestry results from an admixed sample 19D015. g,h, Local ancestry inference accuracy for 19D013 (g, overall score: 0.96) and 19D015 (h, overall score: 0.90). Each dot denotes the ancestry accuracy for each segment (1 cM). i, PCA-projection analysis shows the ancestry of samples in the AIDA and the HBCA cohorts. j, UMAP of Korean and Japanese samples in the AIDA using genotypes called Monopogen. The UMAP was constructed based on the top five PCs of Korean and Japanese genotypes (on 584,164 SNVs). k, Concordance between Illumina GSAv3 genotyping array data and Monopogen calls across the AIDA samples. Darker colors denote a higher level of concordance between two data modalities. Calculation of the concordance scores is detailed in Methods. Full size image

We also performed projection analysis on another 40 samples in the HBCA and the AIDA cohorts that do not have matched WGS data. Again, the global ancestry inferred from single-cell sequencing was consistent with self-reported ethnicities except for one putative admixed sample in the AIDA cohort (Fig. 3i). In the AIDA cohort, it is difficult to separate Japanese and Korean samples by PCA-projecting them onto the HGDP panel. However, these two populations can be well separated by performing independent UMAP analysis using Monopogen-derived genotypes (Fig. 3j). Furthermore, Monopogen shows consistent performance in identifying donor-specific SNVs in the AIDA samples, based on the concordance of Monopogen-derived genotypes and Illumina GSAv3 genotypes (Fig. 3k), demonstrating the possibility of distinguishing individuals from the same ancestry. This indicates that the LD-based genotyping refinement from the commonly used 1KG3 panel did not over-correct genotypes on subpopulation or individual levels, despite sparse sequencing coverage.

Genome-wide association study of cellular quantitative traits

To demonstrate the utilization of Monopogen in establishing the link between genetic variants and cellular quantitative traits in a cell-type or cell-state-specific manner, we characterize the genetic contribution to metabolic processes (such as ATP production) and epigenetic programs in healthy cardiomyocytes. These relationships are usually disguised by previous bulk-based data analysis.

As a demonstration, we collected snRNA-seq and snATAC-seq data of ~4 M cells generated from a human heart left ventricle tissue samples of 65 donors, 54 of which have data from both modalities. Around 791 K SNVs in snRNA-seq and 2.59 M SNVs in snATAC-seq were identified from Monopogen (Supplementary Table 5 and Supplementary Fig. 5a). The variant calling consistency between two modalities was as high as 97% at overlapping loci (Supplementary Fig. 5b,c). Variant calls were further merged for samples of paired modalities.

Ancestry admixture analysis using inferred genotypes shows that this cohort contains samples with diverse ancestry, which are as follows: European (71.1%), Asian (10.2%) and African (8.5%). Six samples appeared admixed (Supplementary Fig. 6a).

To explore the cardiac metabolism process, we extracted cardiomyocyte cells from each sample by annotating cells using the human heart Azimuth database (Fig. 4b and Supplementary Fig. 6b). Using pathway expression level as a proxy for ATP metabolism level, we derived cardiac ATP metabolism level by aggregating the expression levels of 216 genes in GO_ATP_METBOLIC pathway (Methods). We performed association analysis using the GCTA tool22, including the top five ancestry PCs as covariates. P value of 10−5 was used as the threshold to identify potential associations due to the small sample size. The inflation factor of the Quantile–Quantile plot was close to 1 (0.983; Supplementary Fig. 7a). A total of 250 variants were associated with cardiac ATP metabolism score (P < 10−5), which can be further binned into 42 gene regions (Supplementary Table 6), including five genes (at least two variants supported) with P value < 10−6 (Fig. 4c). Among genes in the regions, IGFBP3 and FBXL22 are well known to affect adult cardiac progenitor cells23 or cardiac contractile function24. ADO functions as an oxygen sensor involved in N-degron pathways25. These associations further confirm the tight coupling of ATP production and myocardial contraction, which is essential for normal cardiac function26. AGAP1, indicated by its tag SNV (rs6714660; Fig. 4d), is involved in cardiac ATP production in the Krebs cycle27.

Fig. 4: Genetic association study of cardiomyocyte molecular traits using snRNA-seq and snATAC-seq data from heart left ventricle tissues. a, Analysis workflow. Details can be seen in Methods. b, A UMAP of snRNA-seq cells colored by cell types annotated using Azimuth heart database. c, Manhattan plot showing association of Monopogen SNVs with pathway scores of ATP_METABOLIC in cardiomyocytes. The gray line denotes the P value threshold of 10−6. Genes closest to the top-scoring loci are labeled. d, Boxplot shows the difference of ATP_METABOLIC scores across the three genotypes of rs6714660 (one of the leading variants in AGAP1). e, Manhattan plot showing the association of SNVs with the GATA4 motif-based transcription factor activity level in cardiomyocytes. The gray line denotes the P value threshold 10−6. f, Boxplot shows the difference in GATA4 activity level across the three genotypes of rs17745507 (one of the leading variants in ADAM12). For each box in d,f, the centerline defines the median, the height of the box is given by the interquartile range (IQR) and the whiskers are given by 1.5× IQR. All samples (n = 54) are given as points. Full size image

We also derived transcription factor (TF) activity scores from the snATAC-seq data (Methods). We then scanned for genetic variants associated with the activity level of GATA4, one of the most important TFs highly activated in cardiomyocytes at various developmental stages. The inflation factor of Quantile–Quantile plot was close to 1 (0.984; Supplementary Fig. 7b). A total of 257 variants were identified (P < 10−5), which can be further binned into 42 gene regions (Supplementary Table 7), six of which (at least two variants supported) with P < 10−6 (Fig. 4e). Among the genes in the regions, TBX5–GATA4 and RUNX1–GATA2 complexes are well known for their interdependence in coordinating cardiogenesis28,29,30,31. ADAM12, indicated by its tag SNV (rs17745507; Fig. 4f), is known to have a key role in cardiac hypertrophy by blocking the shedding of heparin-binding epidermal growth factor32. These results indicate a potential association between GATA4 and cardiac hypertrophy through the mediation of ADAM12. Also identified were some variants (P < 10−5), located in the zinc-finger family genes, such as ZNF595 and ZNF750, that act as cofactors with the zinc-finger TF GATA4 (Supplementary Table 7).

In summary, we were able to reveal potential genetic determinants of cardiac health via metabolic and epigenomic trait mapping of cardiomyocytes, despite the relatively small sample size. Associations identified in this fashion may lead to a better understanding of the pathogenicity of noncoding variants in a cell-type-aware manner.

Putative somatic SNV detection on single-cell sequencing

To evaluate the somatic SNV detection module of Monopogen, we examined 1,534 cells from sample of one patient with TNBC sequenced using a single-cell DNA-seq platform33. From the matched normal and tumor bulk WGS data of around 87× coverage each, we identified a total of ~3.5 M germline SNVs and 19,766 somatic SNVs (Methods). We classified new SNVs detected by Monopogen into the following three categories: somatic, germline and unknown in the bulk sample (Methods).

To conduct effective somatic SNV detection, we first examined the rational of applying two-locus and three-locus LD refinement models (Methods) using germline SNVs that had phased genotypes at the cell population level. The two-locus model showed low level of LD refinement (<0.01) when the distance between two adjacent loci was less than 100 bp, which indicates physical phasing within the length of the reads. Genotype correlation between two adjacent loci decreased substantially when distance increases over 100 bp. Unlike the pattern in two-locus model, the three-locus model showed a gradual increase of LD refinement score with increased haplotype length. There are over 70% of cosegregated alleles when the length of haplotypes is less than 5 kb, providing rich information for phasing germline SNVs that do not exist in the 1KG3 panel. This pattern was consistent across all the chromosomes (Supplementary Figs. 8–10).

Initially, Monopogen identified 45,668 de novo SNVs, among which only 9.5% were classified as somatic, 56.0% germline and the remaining unknown. This highlighted the challenge of somatic SNVs detection from pooled single-cell profiles without using external information. The SVM module substantially reduced the number of unknown SNVs by 90%, while keeping 67.3% of the somatic SNVs and 63.8% of the germline SNVs (Fig. 5a), demonstrating the efficacy of the SVM module on distinguishing SNVs from sequencing errors. This could also be confirmed by examining the feature distribution difference between the positive and the negative labels (Supplementary Fig. 8)

Fig. 5: Somatic SNV detection in single-cell sequencing. a,b, LD refinement scores on germline SNVs from the TNBC single-cell DNA data. It is shown with two-locus model in a and three-locus model in b. c, Evaluation of de novo SNVs from Monopogen by comparison with categories defined in matched bulk DNA sample (Methods). d, Distribution of LD refinement scores for de novo SNVs that are classified as germline and somatic SNVs from the bulk sample. e,f, Boxplot displaying the relationship between LD refinement score and BAF, with SNVs classified as somatic (e, n = 339) and germline SNVs (f, n = 2,425). The centerline defines the median, the height of the box is given by the interquartile range (IQR), the whiskers are given by 1.5× IQR and outliers are given as points beyond the minimum or maximum whisker. g,h, LD refinement scores on germline SNVs from the bone-marrow sample measured in single-cell RNA data. It is shown with two-locus model in g and three-locus model in h. In a, b, g and h, the length of haplotypes is grouped into 13 bins (Methods). The x axis is in logarithmic scale. The y axis shows the mean value of LD refinement score within each bin together with the 95% confidence interval. The total number of haplotypes used for evaluation is labeled at the right-bottom of each panel. i, Number of SNVs detected in each step from Monopogen. j, Heatmap displaying the detected percentage of putative somatic SNVs in each mtDNA clone (the sum of each row is 1). k, UMAPs displaying the cell types annotated in myeloid and erythroid lineages. l,m. UMAPs displaying the mutated cell distribution for mtDNA variant 2593G:A (l) and three selected putative somatic SNVs from scRNA-seq (m). n, Heatmap displaying the detected percentage of putative somatic SNVs in each TRB clone. o–s. UMAPs displaying the cell types annotated in T/NK cell lineages (o), the mutated cell distribution for TRB region CASAPNFGQELTYEQYF (p) and the putative somatic SNV chr20:2904623A:G (q), the mutated cell distribution for TRB region CASSQAGAANTEAFF (r) and the somatic SNV chr1:91689518A:G (s). Full size image

The LD refinement module further removed 91% of the germline SNVs, leading to a total of 1,847 somatic SNVs and 1,447 germline SNVs that are validated by bulk WGS, in addition to 2,234 unknowns in the final de novo SNV call set (Fig. 5c). As expected, LD refinement score distribution for germline SNVs were skewed toward 0 (Fig. 5d). A fraction of somatic SNVs also showed score closing to 0, partly due to the confounding B-allele frequency (BAF) effect (Fig. 5e). Somatic and germline SNVs become inseparable when BAF is close to 0.5. Among the putative somatic SNVs detected (Supplementary Table 8), there were 11 known oncogenes and 12 tumor suppressors. The unknown SNVs from Monopogen may contain low-abundance somatic SNVs that were missed by matched bulk sequencing.

We next evaluated the somatic SNV detection module on 9,346 cells obtained from a bone-marrow sample with clonal hematopoiesis34. The cells were profiled using 10× single-cell sequencing combined with mitochondrial transcriptome enrichment (that is, MAESTER technology), leading to joint profiling of gene expressions and mtDNA mutations from the same cells. We also first examined the rational of the two-locus and three-locus LD refinement models from scRNA-seq profiles (Fig. 5g,h and Supplementary Figs. 12 and 13). Different from the single-cell DNA-seq data, the score remained low even though the distance between two adjacent loci was longer than 10 kb, which can be explained by allelic imprinting (or allelic expression) in the transcriptomes. The three-locus LD refinement score showed a similar gradual increase with increased distance, with around 90% of cosegregated alleles when haplotype length is 10 kb. The germline LD refinement patterns examined in both single-cell RNA and single-cell DNA data proved the possibility of capturing both short-distance (within physical reads) and long-distance molecular linkage in single-cell populations even under sparse short-read sequencing. Similarly, feature distributions between the positive and the negative labels were different (Supplementary Fig. 11), enabling SVM classification.

Joint profiling of mtDNA and transcriptomics provided an opportunity to validate the somatic SNVs via comparison of clonal architecture inferred orthogonally from mtDNA variants. We focused on 1,049 cells with both putative somatic SNVs and mtDNA variants detected. There were 391 putative somatic SNVs detected in at least two cells, and 69.6% (272/391) of them were significantly (P < 0.01, Wilcox test) enriched in at least one mtDNA clone (Fig. 5j), with around 12 somatic SNVs in each mtDNA clone. The average cellular concordance between the matched somatic SNV clones and mtDNA clones was 0.63 (Methods). These somatic SNVs allowed finer delineation of the clonal architecture. For example, the most variable mtDNA variant 2593G>A was observed in most of the cell types in both myeloid and erythroid lineages (Fig. 5k,l and Supplementary Fig. 14). However, somatic SNVs such as chr3:196,047,84A:T appeared predominantly in erythroid lineage, while chr9:90891459T:C and chr6: 32,587,781A:T predominantly in myeloid lineage (Fig. 5m and Supplementary Fig. 14).

Joint profiling of T-cell antigen receptor (TCR) variable region and transcriptomics also provided an opportunity to validate somatic SNVs. We noted that 60.3% (284/471) of somatic SNVs were enriched in TRB regions and 52.7% (126/239) in TRA regions (Fig. 5n, Supplementary Fig. 15c), with average cellular concordance of 0.55 and 0.54 for the TRB and the TRA regions, respectively. In T cells and cytotoxic T lymphocytes, there are somatic SNVs localized in subregions of a cell type in the transcriptomic UMAPs (Fig. 5q,c). For example, chr20:2904623A:G clone was detected in the bottom of the cytotoxic T-cell cluster (similar pattern with TRB clone CASAPNFGQELTYEQYF in Fig. 5p and Supplementary Fig. 15b). Some mutations (for example, chr1:91689518A:G) spanned across all the T cells (similar pattern with TRB clone CASSQAGAANTEAFF in Fig. 5r and Supplementary Fig. 15b), indicating these putative somatic SNVs may represent multiple T-cell clonotypes that have occurred from multipotent hematopoietic stem cells.