The selective pressures that shape clonal evolution in healthy individuals are largely unknown. Here we investigate 8,342 mosaic chromosomal alterations, from 50 kb to 249 Mb long, that we uncovered in blood-derived DNA from 151,202 UK Biobank participants using phase-based computational techniques (estimated false discovery rate, 6–9%). We found six loci at which inherited variants associated strongly with the acquisition of deletions or loss of heterozygosity in cis. At three such loci (MPL, TM2D3–TARSL2, and FRA10B), we identified a likely causal variant that acted with high penetrance (5–50%). Inherited alleles at one locus appeared to affect the probability of somatic mutation, and at three other loci to be objects of positive or negative clonal selection. Several specific mosaic chromosomal alterations were strongly associated with future haematological malignancies. Our results reveal a multitude of paths towards clonal expansions with a wide range of effects on human health.
Access optionsAccess options
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Subscribe to Journal
Get full journal access for 1 year
only $3.90 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank Y. Jakubek for assistance with follow-up on del(10q) events8 and G. Bhatia, A. Gusev, M. Lipson, X. Liu, L. O’Connor, N. Patterson, and B. van de Geijn for discussions. This research was conducted using the UK Biobank Resource under Application #19808. A.L.P. was supported by NIH grants R01 HG006399, R01 GM105857, R01 MH101244, and R21 HG009513. P.-R.L. was supported by NIH fellowship F32 HG007805, a Burroughs Wellcome Fund Career Award at the Scientific Interfaces, and the Next Generation Fund at the Broad Institute of MIT and Harvard. G.G., R.E.H., and S.A.M. were supported by NIH grant R01 HG006855 and the the Stanley Center for Psychiatric Research. H.K.F. was supported by the Fannie and John Hertz Foundation. Y.A.R. was supported by NIH award T32 GM007753, a National Defense Science and Engineering Graduate Fellowship, and the Paul and Daisy Soros Foundation. S.F.B. and G.G. were supported by US Department of Defense Breast Cancer Research Breakthrough Awards W81XWH-16-1-0315 and W81XWH-16-1-0316 (project BC151244). S.F.B. was supported by the Elsa U. Pardee Foundation and NCI MSKCC Cancer Center Core Grant P30 CA008748. M.E.T. was supported, in part, by NIH grants UM1 HG008900 and R01 HD081256. Computational analyses were performed on the Orchestra High Performance Compute Cluster at Harvard Medical School, which is partially supported by grant NCRR 1S10RR028832-01, and on the Genetic Cluster Computer (http://www.geneticcluster.org) hosted by SURFsara and financially supported by the Netherlands Scientific Organization (NWO 480-05-003 PI: Posthuma) along with a supplement from the Dutch Brain Foundation and the VU University Amsterdam. This work was supported by a grant from the Simons Foundation (SFARI Awards #346042 and #385027, M.E.T.). We are grateful to all of the families at the participating Simons Simplex Collection (SSC) sites, as well as the principal investigators (A. Beaudet, R. Bernier, J. Constantino, E. Cook, E. Fombonne, D. Geschwind, R. Goin-Kochel, E. Hanson, D. Grice, A. Klin, D. Ledbetter, C. Lord, C. Martin, D. Martin, R. Maxim, J. Miles, O. Ousley, K. Pelphrey, B. Peterson, J. Piggot, C. Saulnier, M. State, W. Stone, J. Sutcliffe, C. Walsh, Z. Warren and E. Wijsman). We appreciate access to genetic and phenotypic data on SFARI Base.Reviewer information
Nature thanks S. Chanock, D. Conrad, I. Hall and the other anonymous reviewer(s) for their contribution to the peer review of this work.
Extended data figures and tables
a–c, UK Biobank mCA sample 2791 has a mosaic deletion of chr13 from approximately 31–53 Mb that cannot be confidently called from unphased BAF and LRR data (a, c). However, the existence of an event is evident in the phased BAF data (b), and the regional decrease in LRR indicates that this event is a deletion. In b, mean phased BAF is plotted for SNPs aggregated into bins spanning n = 25 heterozygous sites; the same bins are used for c. Error bars, s.e.m. d–f, Sample 1645 has a mosaic CNN-LOH on chr9p from the 9p telomere to about 26 Mb that cannot be confidently called from unphased BAF data (d) but is evident in phased BAF data (e). A phase switch error causes a sign flip in phased BAF at approximately 20 Mb. The lack of a shift in LRR in the region (f) indicates that this event is a CNN-LOH. In e, mean phased BAF is plotted for SNPs aggregated into bins spanning n = 50 heterozygous sites; the same bins are used for f. Error bars, s.e.m. g–i, Sample 2464 has a full-chromosome mosaic event on chr12 that cannot be confidently called from unphased BAF and LRR data (g, i) but is evident in phased BAF data (h). Several phase switch errors cause sign flips in phased BAF across chr12. The slight positive shift in mean LRR (i) indicates that this event is most likely to be a mosaic gain of chr12. In h, mean phased BAF is plotted for SNPs aggregated into bins spanning n = 50 heterozygous sites; the same bins are used for i. Error bars, s.e.m.
We generated age distributions for (i) ‘high-confidence’ detected events passing a permutation-based FDR threshold of 0.01 (bright red); (ii) ‘medium-confidence’ events below the FDR threshold of 0.01 but passing an FDR threshold of 0.05 (darker red); and (iii) ‘low-confidence’ events below the FDR threshold of 0.05 but passing an FDR threshold of 0.10 (darkest red; not analysed but plotted for context). We compared these distributions to the overall age distribution of UK Biobank participants (grey). On the basis of the numbers of events in each category, approximately 20% of medium-confidence detected events are expected to be false positives. To estimate our true FDR, we regressed the medium-confidence age distribution on the high-confidence and overall age distributions, reasoning that the medium-confidence age distribution should be a mixture of correctly called events with age distribution similar to that of the high-confidence events, and spurious calls with age distribution similar to the overall cohort. We observed a regression weight of 0.31 for the component corresponding to spurious calls, in good agreement with expectation, and implying a true FDR of 7.5% (6.2–8.8%, 95% CI based on regression fit on n = 6 age bins).
Extended Data Fig. 3 Clonal cell fractions of co-occurring events generally suggest co-existence within the same cell population.
For each pair of significantly co-occurring events (Fig. 2b), we compared the clonal fractions of the two events within each individual that carried both events. Each point in the plots corresponds to an individual carrying the pair of events under consideration; individuals are colour-coded by the total number of events they carry. For nearly all pairs of events, the clonal fractions of the two events were very similar in most individuals carrying both events, suggesting that the events occurred in the same clonal cell population. A few exceptions do seem to exist; for example, 22q– versus 13q CNN-LOH cell fraction; here, the cell fractions suggest that 13q CNN-LOH events may be present in a subclone. This observation is consistent with acquired uniparental disomy of 13q providing a second hit within a del(13q14) clonal expansion, as we see in Extended Data Fig. 8. (We did not include del(13q14) vs. 13q CNN-LOH in this plot because inference of clonal fractions is complex for these overlapping events; see Extended Data Fig. 8.)
Extended Data Fig. 4 Replication of previous association between JAK2 46/1 haplotype and 9p CNN-LOH in cis due to clonal selection.
The common JAK2 46/1 haplotype has previously been shown to confer risk of somatic JAK2 V617F mutation such that subsequent 9p CNN-LOH produces a strong proliferative advantage15,16,17,18,20 (right). In our analysis, CNN-LOH on 9p is strongly associated with JAK2 46/1 (P = 1.6 × 10−13, OR = 2.7 (2.1–3.5); Fisher’s exact test on n = 120,664 individuals) with the risk haplotype predominantly duplicated by CNN-LOH in hets (52 of n = 61 heterozygous cases; binomial P = 1.8 × 10−8). Left, the genomic modification is illustrated in the top panel and association signals are plotted in the bottom. The lead associated variant is labelled, and variants are coloured according to linkage disequilibrium with the lead variant (scaled for readability).
Extended Data Fig. 5 Evidence of multiple causal variants for 10q25.2 breakage and 1p CNN-LOH associations.
a, Multiple expanded repeats at FRA10B drive breakage at 10q25.2. We identified 12 distinct primary repeat motifs at FRA10B in 26 whole-genome-sequenced individuals from 14 families (labelled VNTR-N-x, where N denotes length in base pairs); carriers of these repeats exhibit varying degrees of FRA10B repeat expansion (Supplementary Note 8). The repeat motifs are AT-rich and are similar to FRA10B repeats previously reported35. The alignment provided here includes the repeat motifs that were most frequently observed in FRA10B expanded alleles35 (E8, E13, E17, and E19) along with a few other closely related expanded repeat motifs (E10, E11, and E12). b, Carriers of the 10q terminal deletion in the UK Biobank share long haplotypes at 10q25.2 identical-by-descent. Square nodes in the IBD graph correspond to males and circles to females. Node size is proportional to cell fraction and edge weight increases with IBD length. Coloured nodes indicate imputed carriers of variable number tandem repeats (VNTRs) at FRA10B (Supplementary Table 7); colour intensity scales with imputed dosage. c, Identity-by-descent graph at MPL locus (chr1:43.8 Mb) on individuals with mCAs on chr1 extending to the p telomere. Colored nodes indicate imputed carriers of SNPs independently associated with mosaic 1p CNN-LOH (Fig. 4a).
a, Read depth profile plot of WGS samples in the terminal 700 kb of chr15q. Three individuals in one family carry an approximately 70-kb deletion at 15q26.3, and a fourth carries the same deletion along with an approximately 290-kb duplication (probably on the same haplotype, based on population frequencies of these events; see Extended Data Fig. 7). These four individuals (highlighted in blue) segregate with the rs182643535:T allele in the WGS cohort. Inset: the parental carrier in the family, individual 10921, has detectable mosaicism in two distinct 15q CNN-LOH subclones (one starting at 41.64 Mb with 4.6% cell fraction, the other starting at 71.64 Mb with an additional 2.0% cell fraction). b, Expanded read depth profile plot, with deletion-only individuals highlighted in blue and the del + dup individual highlighted in green. Breakpoint analysis indicates that the deletion spans chr15:102151467–102222161 and contains a 1,139-bp mid-segment (chr15:102164897–102166035) that is retained in inverted orientation. The duplication spans chr15:102026997–102314016.
Using identified breakpoints of the germline 70-kb deletion and 290-kb duplication (Extended Data Fig. 6), we computed mean genotyping intensity (LRR) in UK Biobank samples within the 70-kb deletion region (24 probes) and within the flanking 220-kb region (97 probes). Individuals are plotted by flanking 220-kb mean LRR versus 70-kb mean LRR and coloured according to mosaic status for somatic 15q mCAs. UK Biobank samples carrying the 70-kb deletion, 290-kb duplication, and both (del+dup) are all easily identifiable in distinct clusters. The plot also appears to contain clusters with higher copy number. Of the three CNV-carrying alleles, the simple 70-kb deletion is the only one that predisposes to mCAs. Most mosaic events containing the 70-kb deletion are CNN-LOH events that make cells homozygous for the 70-kb deletion; two individuals have somatic loss of the homologous (normal) chromosome, making cells hemizygous for the 70-kb deletion.
All of the plots exhibit step functions of increasing |ΔBAF| towards a telomere, which is the hallmark of multiple clonal cell populations containing distinct CNN-LOH events that affect different spans of a chromosomal arm (all extending to the telomere). Distinct |ΔBAF| values (called using an HMM) are indicated with different colours. Flips in the sign of phased BAF usually correspond to phase switch errors. Two samples exhibit high switch error rates: 14q individual 3067 (explained by non-European ancestry), and 1p individual 23 (explained by very high |ΔBAF|; extreme shifts in genotyping intensities result in poor genotyping quality). All five individuals with multiple CNN-LOH events on chr13q appear to contain switch errors over 13q14, but these switches are actually explained by overlapping 13q14 deletions; see Supplementary Note 1 for detailed discussion.
Extended Data Fig. 9 CLL prediction accuracy: receiver operating curves and precision-recall curves.
CLL prediction benchmarks using tenfold stratified cross validation on: only individuals with lymphocyte counts in the normal range (1 × 109/L to 3.5 × 109/L), as in our primary analyses (n = 36 cases, 113,923 controls) (a, b); and individuals with any lymphocyte count (n = 78 cases, 118,481 controls) (c, d). a matches Fig. 5b, and b shows the precision-recall curve from the same analysis. c and d correspond to an analogous analysis in which we removed the restriction on lymphocyte count and also used additional mosaic event variables for prediction (11q–, 14q–, 22q–, and total number of autosomal events). In both benchmarks, individuals with previous cancer diagnoses or CLL diagnoses within 1 year of assessment were excluded; however, some individuals with very high lymphocyte counts pass this filter (and probably already had CLL at assessment despite being undiagnosed for more than 1 year), hence the difference in apparent prediction accuracy between the two benchmarks.
Extended Data Fig. 10 Mosaic chromosomal alterations detected in CLL cases sorted by lymphocyte count.
Individuals are stratified by cancer status at DNA collection (no previous diagnosis versus any previous diagnosis), and mCAs (red, loss; green, CNN-LOH; blue, gain; grey, undetermined) are plotted per chromosome as coloured rectangles (with height increasing with BAF deviation).
This file contains Supplementary Notes 1-9, Supplementary References and Supplementary Tables 1-16.
This Excel file contains spreadsheets with individual-level mosaic event calls from this manuscript and previous studies of clonal hematopoiesis.
This zip archive contains BED-format UCSC Genome Browser tracks for event calls from this manuscript and previous studies of clonal hematopoiesis. The archive also contains a readme document describing how the files were generated.