## Main

Replication of the genome is required at each cell division. It is effected by DNA polymerases synthesizing a new DNA strand with a sequence dictated by a template strand. Low error rates are ensured by the fidelity of base incorporation, proofreading capabilities of the polymerases and surveillance by the DNA mismatch repair machinery. DNA replication in humans is largely performed by the polymerases Pol ε and Pol δ, which undertake leading and lagging strand synthesis, respectively1,2.

Uniquely among nuclear polymerases, both Pol ε and Pol δ have proofreading activities mediated by their exonuclease domains, which identify and remove mismatched bases1,3,4,5. Somatically acquired heterozygous missense mutations in the POLE or POLD1 exonuclease domains found in some human cancers cause defective proofreading and, consequently, high burdens of somatic mutations with distinctive mutational signatures6,7,8,9. Cancers with POLE exonuclease domain mutations show very high single-base substitution (SBS) mutation burdens whereas those with POLD1 exonuclease domain mutations show less elevated SBS burdens but are often associated with microsatellite instability8. Mutations generated by proofreading-defective POLE and POLD1 show marked replication strand bias consistent with their differential roles in leading and lagging strand synthesis1,2,8. Polymerases with these mutations also cause mutator phenotypes when engineered into yeast and mice10,11,12,13,14,15.

POLE and POLD1 exonuclease domain mutations can also be inherited through the germline, causing a rare autosomal dominant familial cancer predisposition syndrome known as polymerase proofreading-associated polyposis (PPAP), characterized primarily by early-onset colorectal and endometrial tumors16,17,18. It is plausible that an increased somatic mutation rate underlies this cancer predisposition, and high somatic mutation loads have been reported in the small number of neoplasms analyzed from such individuals16. However, whether the mutation rate is elevated in normal cells, or in neoplastic cells only, is not known. If elevated in normal cells, the magnitude of the increase, whether it is raised over the whole lifespan, the range of tissues and fraction of cells in each tissue it affects, and the impact of subsequent neoplastic change are important questions to address in elucidating the pathogenesis of neoplastic transformation.

Accrual of somatic mutations has been proposed as the primary biological mechanism underlying aging19,20,21,22,23,24. This hypothesis is based on the premises that (1) mutations accumulate throughout life and (2) higher mutation loads cause widespread malfunction of cell biology19,21,22,24. Recent reports have confirmed that the somatic mutation burden in normal cells does increase during life in a more-or-less linear manner25,26,27,28,29,30,31,32,33, compatible with a causal role for somatic mutations in aging. However, somatic mutations could, in principle, accumulate without significant biological consequences. Thus, study of individuals with inherited POLE or POLD1 exonuclease domain mutations could provide insight into the wider biological consequences of elevated mutation burdens and the pathogenesis of aging.

## Results

### Clinical information and samples

Fourteen individuals, aged between 17 and 72 years and each carrying one of four different germline exonuclease domain mutations in POLE or POLD1 (POLE L424V (n = 8), POLD1 S478N (n = 4), POLD1 L474P (n = 1) and POLD1 D316N (n = 1)), were studied. Eleven had a history of five or more colorectal adenomas, with age at first polyp diagnosis ranging from 15 to 58 years. Five were diagnosed with colorectal cancer, all before age 50 years, and all had a known family history of colorectal adenoma, colorectal cancer and/or other cancers. No other consistent phenotypic abnormalities were reported (Supplementary Table 1)18.

### Mutagenesis in normal intestinal stem cells

The epithelial cell population of an intestinal crypt is a clone derived from a single ancestral crypt stem cell that existed <10 years before sampling34,35,36,37. Somatic mutations found in the large majority of cells in a crypt, and thus with a high variant allele fraction (VAF), recapitulate the set of mutations present in that ancestral cell28. Thus, to investigate somatic mutation burdens and rates in normal cells from POLE and POLD1 germline mutation carriers, 109 normal intestinal crypts (colorectum, n = 85; ileum, n = 10; duodenum, n = 14) were individually isolated by laser-capture microdissection from biopsy and surgical resection samples of 13 individuals and whole-genome sequenced (WGS) (median 33.5-fold coverage) (Methods, Supplementary Note and Supplementary Table 2).

Somatic SBS burdens in the seven individuals with POLE L424V and four with POLD1 S478N correlated with age, indicating that mutation accumulation is likely to be continuous through life, linear and at similar rates in each individual carrying the same mutation (linear mixed-effects model R2 = 0.87; Supplementary Note). Crypts from individuals with POLE L424V showed an average SBS mutation rate of 331 per year (linear mixed-effects model 95% confidence interval (CI) 259–403, P = 10−12) (Fig. 1a,b and Supplementary Note). The POLD1 S478N germline mutations were associated with an SBS rate of 152 per year (linear mixed-effects model 95% CI 128–176, P = 10−17), and POLD1 D316N and L474P were associated with an SBS rate of 58 per year (linear mixed-effects model 95% CI 51–65, P = 10−22). By comparison, intestinal crypts from healthy individuals acquire 49 SBS per year28 (linear mixed-effects model 95% CI 46–52, P = 10−36). Therefore increased somatic SBS rates are present in all normal intestinal cells of individuals with POLE or POLD1 germline mutations (Fig. 1), although there are differences in mutation rates between POLE (~sevenfold higher than normal individuals) and POLD1 (up to threefold higher) germline mutations, and between different POLD1 mutations. Indeed, individuals with POLD1 D316N and L474P exhibited relatively modest elevations of SBS rates (~1.2-fold). There was also evidence of differences in SBS rates between the seven individuals with POLE L424V, suggesting the existence of genetic and/or environmental modifiers of mutation rate (Fig. 1b and Supplementary Note).

Small insertion and deletion (ID) mutation rates in normal intestinal crypts were also elevated in individuals with germline POLE/POLD1 mutations, with rates of 13 per year (POLE L424V), 44 per year (POLD1 S478N) and 12 per year (POLD1 D316N and POLD1 L474P) (linear mixed-effects model, 95% CI 10–16, 35–53, 9–16, P = 10−10, P = 10−13 and P = 10−9, respectively). These are all substantially above the expected rate of one per year in individuals without POLE or POLD1 mutations28 (Fig. 1d and Supplementary Note).

Thus, in the normal intestinal crypt, POLE germline mutations generally confer greater increases in SBS mutation rates than POLD1 mutations, whereas POLD1 mutations confer elevations in ID mutation rates larger than or similar to POLE (Fig. 1e). These relative differences in SBS and ID mutation rates are consistent with previous findings in cancer genomes and experimental systems6,8,11,38,39,40. It is noteworthy that the SBS and/or ID mutation burdens in normal intestinal crypts from middle-aged and older individuals with germline polymerase mutations are higher than those observed in many human cancers41.

With the exception of one individual who had been treated with oxaliplatin for colorectal cancer and one with an incidental finding of mosaic trisomy of the X chromosome (47 XXX) (Extended Data Figs. 1a and 2), copy number changes and rearrangements were rare, occurring at a prevalence similar to normal crypts from normal individuals (Supplementary Table 2). Reductions in telomere length with age also occurred at rates similar to those of normal crypts from healthy individuals (Supplementary Table 2 and Supplementary Note).

Elevated genome-wide mutation burdens are associated with increases in protein-coding mutations (Extended Data Fig. 3a). These include nonsense and frameshift changes that are likely to impair protein function (Extended Data Fig. 3a). There was a greater increase in nonsense mutations (~sevenfold) compared to missense or synonymous mutations (~four- and ~threefold, respectively), which is due to the specific mutational signatures present42. There was also an increase in potential cancer ‘driver’ mutation burden compared with normal crypts from normal individuals (20/109 versus 26/445, P = 0.00005, chi-squared test; Extended Data Fig. 3b and Supplementary Table 3). However, this was broadly compatible with the increase in overall load and the spectrum of coding mutations in the colon (Extended Data Fig. 3c).

Crypt-like structures from six colorectal adenomas and one carcinoma from individuals with germline POLE or POLD1 mutations were also microdissected and sequenced. SBS and ID burdens were considerably higher than in normal crypts from the same individuals sampled at the same time (Fig. 1e,f), albeit with substantial variation between lesions. Therefore, increases in SBS and ID mutation rates are associated with conversion from a normal crypt to an adenoma or cancer crypt in individuals with POLE or POLD1 germline mutations, a pattern similar to that observed in healthy individuals43,44.

### Mutational signatures

Eleven SBS mutational signatures were observed in normal intestinal crypts from individuals with POLE and POLD1 germline mutations (Fig. 2a–c and Extended Data Fig. 4). Nine have been previously reported: SBS1, SBS5, SBS10a, SBS10b, SBS17b, SBS28, SBS35, SBS88 and SBS89 (https://cancer.sanger.ac.uk/cosmic/signatures)9,28. SBS1 (characterized by C>T substitutions at NCG trinucleotides and probably due to deamination of 5-methylcytosine) and SBS5 (of unknown etiology) are found in all normal intestinal crypts from healthy individuals, where they accumulate in a more-or-less linear manner with age7,9,28,45. SBS88 and SBS89 are found in normal intestinal crypts from some healthy individuals and are predominantly acquired during childhood28,46. SBS88 is likely due to colibactin, a mutagenic product of a strain of Escherichia coli sometimes present in the colon microbiome47. SBS10a, SBS10b and SBS28 were previously found in the subsets of colorectal, endometrial and other cancer types with somatically acquired POLE mutations7,9 (Fig. 2a). Two therapy-associated signatures were identified: an SBS35-like signature associated with platinum-based chemotherapy9,48 in an individual treated with oxaliplatin (Fig. 2c and Extended Data Fig. 1b–e), and a signature characterized predominantly by T>G mutations (SBS17b-like) in an individual treated with capecitabine48,49(Fig. 2f). Two previously unreported mutational signatures (SBS10c and SBS10d) were observed in normal and neoplastic crypts from individuals with germline POLD1 mutations. Both were characterized predominantly by C>A substitutions: in SBS10c at ACC, CCA, CCT, TCA and TCT trinucleotides and in SBS10d at TCA and TCT trinucleotides (the mutated bases are underlined) (Fig. 2d,e, Extended Data Figs. 57 and Methods).

The increases in SBS burdens in normal intestinal crypts from POLE germline mutation carriers compared to healthy individuals were almost completely attributable to SBS10a, SBS10b and SBS28 mutations, and in POLD1 mutation carriers to SBS10c mutations. By contrast, the estimated burdens of SBS1, SBS5, SBS88 and SBS89 found in normal intestinal crypts from POLE/POLD1 germline mutation carriers were similar to those expected in normal individuals of the same age. Thus, defective POLE/POLD1 proofreading appears not to substantially affect the rates of the mutational processes underlying SBS1, SBS5, SBS88 and SBS89. POLE and POLD1 are responsible for leading and lagging strand DNA synthesis, respectively1,2. Consistent with these roles, there was marked replication strand bias of SBS10a and SBS10b somatic mutations in POLE germline mutation carriers, with the opposite bias of SBS10c and SBS10d in POLD1 mutation carriers (Extended Data Fig. 7b).

The elevated mutation loads in adenoma and carcinoma crypts from POLE/POLD1 germline mutation carriers compared to normal intestinal crypts from each individual were also predominantly due to increased burdens of SBS10a and 10b (in POLE-mutant cases) and SBS10c and SBS10d (in POLD1-mutant cases) (Fig. 2f,g). However, crypts from POLE-mutant polyps showed greater relative increases in SBS10a than SBS10b compared to histologically normal POLE crypts (Fig. 2f,g). The mechanisms underlying these marked accelerations in mutation rates during neoplastic change, and why they apply differentially to the processes underlying the different signatures, are unknown.

Insertions and deletions in normal intestinal crypts from both POLE and POLD1 germline mutation carriers were dominated by single T insertions at T homopolymer tracts, characteristic of signature ID1. ID1 mutations were also further increased in neoplastic crypts compared to normal crypts from each individual (Fig. 1c,f, Extended Data Fig. 8 and Supplementary Table 2).

Cancer driver mutations found in crypts from normal intestine, and colorectal neoplasms from individuals with POLE/POLD1 germline mutations, showed SBS and ID mutational spectra similar to genome-wide spectra from normal intestinal crypts from these individuals (Extended Data Fig. 9a–c and Supplementary Table 3).

### Mutagenesis in other tissues

Cancers of the colorectum and endometrium are the predominant types associated with germline and somatic POLE/POLD1 mutations16,17,18,50. Similar to intestinal crypts, endometrial glands are clones derived from a single, recent ancestral stem cell, and whole-genome sequencing of a gland reveals the mutations present in that ancestral cell31,51. Eleven endometrial glands dissected from a 60-year-old individual carrying a POLE L424V germline mutation showed elevated rates of SBS (148 versus 29 per year in healthy individuals) (Fig. 3b) and ID (6 versus <1 per year in normal individuals) (Supplementary Table 2). SBS10a and SBS10b were responsible for the increase in SBS rate compared to healthy individuals (Fig. 3a,b). Somatic driver mutations in cancer genes are common in normal human endometrium31,32,52 and were found in a similar repertoire of genes in all but one gland from this individual (Extended Data Fig. 9e and Supplementary Table 3).

To further investigate mutagenesis in other cell types and embryonic germ layers, we microdissected and sequenced fragments of various tissues from a 46-year-old individual with a germline POLE L424V mutation (PD44594). In skin epithelium, aging- and UV-related signatures were accompanied by substantial contributions from SBS10a and SBS10b. The VAFs of mutations generated from the remaining tissues indicated that they contained many cell clones, and thus mutation burdens and mutational signatures were difficult to assess (Supplementary Table 2). For these, we used a modified duplex sequencing protocol technique, termed NanoSeq53, to investigate the mutational processes present. By sequencing single DNA molecules at low error rates, this method allows quantification of mutation burdens and signatures from tissues in which cells derived from many progenitors are intimately mixed, and clonal units for sequencing cannot be dissected (Methods). In all tissues we subjected to this method (smooth muscle, skeletal muscle, arterial tunica and cerebral cortex), aging-related signatures were accompanied by SBS10a and SBS10b. Therefore, in all evaluable tissues from this individual, SBS10a and SBS10b were clearly observed, indicating that these tissues carry elevated mutation burdens (Fig. 3a).

Next, we assessed the mutational processes in blood and in sperm from four individuals—two with POLE L424V and two with POLD1 mutations. All samples showed elevated total SBS rates compared with normal controls (Fig. 3a,b). Estimated yearly mutation rates due to SBS1 and SBS5 in blood and sperm were, however, consistent with normal controls53 and previous estimates54,55, and the excess mutation burdens were, for the most part, due to SBS10a and SBS10b (POLE-mutant individuals) and SBS10c (POLD1-mutant individuals) (Fig. 3a,b).

Age-related clonal hematopoiesis (ARCH) is a common condition caused by a characteristic set of somatically acquired driver mutations56,57. To investigate the effect on ARCH of the elevated mutation rates caused by germline POLE and POLD1 mutations, 22 blood samples from 14 individuals were sequenced to ~10,000-fold coverage for ARCH-associated driver mutations58. No evidence of these was observed at the standard 2% VAF threshold. The results are consistent with previous observations that hematological malignancies are not part of the clinical spectrum observed in these individuals16,17,18 and support the broader clinical findings that, despite the elevated genome-wide mutation rate, individuals with POLE and POLD1 mutations do not show an increased frequency of age-related phenotypes.

In summary, mutational signatures associated with POLE/POLD1 exonuclease domain mutations were found in all cells from all tissues examined and across the three germ layers. The elevation of burden was variable among tissues, being higher in intestinal crypts and endometrial glands than in other tissues. The results from sperm indicate that the elevated mutation rate extends beyond somatic tissues into the germline.

### Mutagenesis during early embryogenesis

Somatic mutations accumulate throughout development, from the first cell division onwards59,60,61,62. A mutation arising in an early embryonic cell may be present in a substantial proportion of adult cells and in multiple different tissues59. Early embryonic mutations can be detected in whole-genome sequences of highly polyclonal adult tissue samples as mutations with relatively high VAF59,60,62. Using this approach, putative early embryonic mutations were identified from whole-genome sequences of whole-blood samples. The embryonic mutational spectra of some POLE-mutant individuals exhibited significantly larger exposures to SBS10a, SBS10b and SBS28 (P < 0.05, Wilcoxon rank-sum test), whereas others were dominated by SBS1 and SBS5, the signatures normally responsible for early embryonic mutations62 (Fig. 3c). Similarly, in POLD1-mutant cases, the number of early embryonic single-base pair (bp) insertions was highly elevated in some, but not all, individuals (Fig. 3d). This heterogeneity reflects the inheritance pattern of the germline mutation and is probably a consequence of the maternal to zygotic transition of gene expression63. When a germline POLE/POLD1 mutation is paternally inherited, any effect on mutagenesis is delayed until zygotic genome activation, thus sparing the early embryo for the first few cell divisions. When maternally inherited, however, the defective proofreading polymerase is present in the cytoplasm of the ovum and POLE/POLD1 mutagenesis therefore occurs immediately after fertilization, leading to a high prevalence of such mutations in early embryogenesis. These results indicate that mutagenesis due to defective POLE/POLD1 proofreading is present at the earliest stages of life.

### Differential mutation burdens across the genome

We compared the distribution of somatic mutations across the genome to the mutation burden in the protein-coding exome in individuals with germline POLE/POLD1 mutations (Fig. 4a,b). Mutation burdens due to the various forms of SBS10 were heavily biased towards intronic, intergenic and late-replicating regions (Extended Data Fig. 7d,e), proportionately sparing protein-coding exons. This relatively constrained increase in mutation burdens in protein-coding exons may conceivably mitigate the biological consequences of elevated somatic mutation rates due to POLE/POLD1 germline mutations. Nevertheless, differential burdens between tissues were maintained and mutation rates in coding regions were increased in colon and endometrium more than in skin (Fig. 4b).

## Discussion

This study shows that multiple normal cell types from POLE/POLD1 exonuclease domain germline mutation carriers demonstrate the mutational signatures and elevated somatic SBS and ID mutation rates characteristic of defective proofreading by these polymerases. The results are consistent with the presence of elevated mutation rates in all cells of all types throughout life.

The extent of the elevation in mutation rate appears greater in intestinal and endometrial epithelium than in the other cell types analyzed. The basis for this variation is not understood, but may reflect different stem cell division rates. It may also, at least in part, explain the predilection for colorectal and endometrial cancer observed in these individuals.

The somatic mutation theory of aging proposes that the increasing somatic mutation burdens in normal cells, continuously accrued over a lifetime, have increasingly detrimental effects on cell function and thus engender the set of phenotypic features collectively termed aging19,20,21,22,23,24. The mutation burdens observed in cells from POLE/POLD1 mutation carriers are higher than those in normal individuals of the same ages. Therefore, POLE/POLD1 mutation carriers have elevated ‘mutational ages’ (Fig. 4c,d). The biological consequences of this generalized elevated mutational age appear, however, to be relatively limited. Other than the increase in incidence of colorectal, endometrial and other neoplasms, phenotype information from more than 100 POLE/POLD1 mutation carriers does not obviously reveal features of premature aging or early onset of age-related, non-neoplastic disease, and many survive into the late decades of the standard human lifespan16,17,18 (Supplementary Table 1). Therefore, the rare natural experiment of germline POLE/POLD1 exonuclease domain mutations leading to elevated mutation rates does not support a simple somatic mutation theory of aging. The results are, moreover, similar to those obtained in mice with engineered germline POLE and POLD1 exonuclease domain mutations10,11.

Important cautions, however, should temper this conclusion. First, more comprehensive measurement of somatic mutation burdens across cell types in POLE/POLD1 germline mutation carriers is indicated. The varying degrees of mutation rate elevation between cell types potentially leaves some—which could be particularly influential in generating the aging phenotype—relatively protected. Second, POLE/POLD1 exonuclease domain germline mutation carriers show burdens of somatic copy number changes, rearrangements and telomere erosion similar to normal individuals (Supplementary Table 2 and Supplementary Note). If aging depends on these mutation classes, it would not be accelerated in POLE/POLD1 mutation carriers. Finally, additional factors may mitigate the impact of elevated mutation burdens in POLE/POLD1 exonuclease domain carriers. For example, a disproportionately small fraction of the mutation burdens due to SBS10a,b,c,d falls in coding regions of the genome, potentially reducing its biological impact64.

Nevertheless, the results indicate that many normal human cell types throughout life tolerate high SBS and ID mutation rates and therefore that the direct effects of somatic mutation accumulation may not underlie all components of the progressive biological dysfunction termed aging.

## Methods

### Ethical approval and study participants

This research complies with all relevant ethical regulations. Patients were recruited as part of the CORGI 2 study, United Kingdom Research Ethics Committee (REC) no. 17/SC/0079. Additional sample collection was undertaken under approval from the following committees: London, Westminster; North East, Newcastle and North Tyneside 1; and NRES Committee East of England, Cambridge South (REC nos. EC04/015, 16/NE/003 and 07-MRE05-44, respectively). Informed consent was obtained from all participants and no monetary compensation was offered for their participation. A complete list of study participants and tissue samples is summarized in Supplementary Tables 1 and 2.

### DNA extraction from bulk samples

Frozen whole blood underwent DNA extraction using the Gentra Puregene Blood Kit (Qiagen). Briefly, 1–2 ml of frozen blood was thawed, lysed in RBC lysis solution and centrifuged. Cell pellet was resuspended in cell lysis solution and incubated at 37 °C for 2 h. RNA and protein were degraded using RNase A solution and protein precipitation solution, respectively. DNA was precipitated with isopropanol, and was extracted from semen samples using β-mercaptoethanol followed by phenol chloroform extraction65.

### Tissue preparation

Tissues were embedded in Optimal Cutting Temperature compound, and frozen histological sections were cut at 30 µm, mounted on polyethylene naphthalate (PEN) slides and fixed in 70% ethanol for 5 min, followed by two washes with PBS for 1 min each. Slides were manually stained in hematoxylin and eosin (H&E) using a conventional staining protocol. A subset of samples (PD44594c–h and PD44589f) were fixed in PAXgene Tissue FIX (Qiagen) according to the manufacturer’s instructions. Fixed tissue samples were embedded in paraffin using a Tissue-Tek tissue-processing machine (Sakura). No formalin was used in the preparation, storage, fixation or processing of samples. Processed tissue blocks were embedded in paraffin wax, sectioned to 10-µm thickness and mounted onto PEN slides (Leica). Tissue slides were stained using a standard H&E protocol. Slides were temporarily coverslipped and scanned on a NanoZoomer S60 Slide Scanner (Hamamatsu); images were viewed using NDP.View2 software (Hamamatsu).

### Laser-capture microdissection

Laser-capture microdissection was undertaken using a LMD7000 microscope (Leica) into a skirted 96-well PCR plate. Cell lysis was undertaken using 20 µl of proteinase-K PicoPure DNA Extraction kit (Arcturus), and samples were incubated at 65 °C for 3 h followed by proteinase denaturation at 75 °C for 30 min. Thereafter samples were stored at −20 °C before DNA library preparation.

### Intestinal crypt isolation

Crypts from one tissue block (PD44593e) were isolated using EDTA chelation. In brief, dissected mucosa was incubated in an EDTA solution and gently agitated, resulting in dissociation of intestinal crypts from the underlying components of the intestinal epithelium. Crypts were then separated under a light microscope and placed in ATL buffer (Qiagen) containing 10% (v/v) proteinase K and digested overnight at 56 °C. DNA extraction was performed using the QiaAMP DNA micro kit (Qiagen) as per the manufacturer’s instructions; DNA was then stored at −20 °C.

### Low-input DNA library preparation and sequencing

DNA library preparation of microdissected tissue samples was undertaken as previously described, using a bespoke low-input enzymatic-fragmentation-based library preparation method28,29,31,66. This method was employed as it allows for high-quality DNA library preparation from a very low starting quantity of material (100–500 cells). DNA library concentration was assessed after library preparation and used to guide the choice of samples to take forward to DNA sequencing. Minimum library concentration was 5 ng µl–1, and libraries with >15 ng µl–1 were preferentially chosen; 150-bp, paired-end Illumina reads were prepared with Unique Dual Index barcodes (Illumina).

DNA sequencing was undertaken on a NovaSeq 6000 platform using the XP kit (Illumina). Samples were multiplexed in pools of 6–24, then sequenced to achieve a coverage of ≥30×.

### Mutation calling and postprocessing filters

Sequencing reads were aligned to NCBI human genome GRCh37 and aligned using Burrows–Wheeler alignment (BWA-MEM). SBS were called using the algorithm cancer variants through expectation maximization67. Mutations were called using an unmatched normal synthetic bam file to retain early embryonic and somatic mutations. Postprocessing filters were applied to remove low-input, library-preparation-specific artifacts and germline mutations using a previously described method30,31,62,66. Filters applied: (1) common single-nucleotide polymorphisms (SNPs) were removed by filtering against a panel of 75 unmatched normal samples68; (2) to remove mapping artifacts, mutations were required to have a minimum median read alignment score of mutant reads ≥140, and fewer than half of the reads supporting the mutation should be clipped (clipped median = 0); (3) a filter was utilized to remove overlapping reads resulting from relatively short insert size, which could lead to double counting of variant reads; and (4) a filter was used to remove cruciform DNA structures that can arise during the low-input library preparation method.

Next, we applied multiple filters to remove germline variants and potential artifacts whilst retaining bona fide embryonic and somatic variants. This approach has been detailed in previous publications, and the code for these filters can be found at https://github.com/TimCoorens/Unmatched_NormSeq. Mutations were aggregated per patient, and a read pile-up was performed using an in-house algorithm (cgpVAF) to tabulate the read count of mutant and reference reads per sample for each mutation locus. Germline mutations were filtered out using an exact binomial test, which is used to distinguish germline from somatic variants and utilizes aggregate read counts from all samples of the same patient30,62. In brief, the read depth across all samples from that individual was calculated (median in this study, ~340×). This high coverage yields a very precise estimate of the true VAF of each mutation. While the VAF estimates of the earliest embryonic single-nucleotide variants (SNVs) and germline variants from samples sequenced at 30× might overlap, VAFs from the aggregate coverage from that individual (~340×) are distinguishable using statistical testing. To achieve this, the beta-binomial test was applied using the overdispersion parameter (ρ) threshold for genuine variants of ρ > 0.1.

Phylogenetic trees were created using MPBoot (v.1.1.0 bootstrapped, 1,000) and mutations were mapped to branches using maximum-likelihood assignment.

Indels were called using Pindel69 utilizing the same synthetic unmatched normal sample employed in SBS mutation calling. ID calls were filtered to remove those with a quality score of <300 (‘Qual’; sum of mapping qualities of the supporting reads) and a read depth of <15. Thereafter, ID filtering was performed in a similar manner to that of SBS, to remove germline variants and library preparation/sequencing artifacts.

### Copy number alteration calling

Somatic copy number variants (CNVs) were called using the algorithm allele‐specific copy number analysis of tumors (ASCAT)70 (https://github.com/Crick-CancerGenomics/ascat) in the ascatNGS package71. Bulk (blood or, in one case, tissue) samples were used as matched normals. ASCAT was initially run with default parameters. To reduce the number of false-positive calls that arise in normal tissue samples, a segmentation penalty was applied in the ASCAT ‘aspcf’ step. Optimum performance was observed with a penalty value of 100, which was subsequently applied to all samples. Copy number calls were further filtered to remove artifacts. Copy number (CN) calls <2 MB were excluded. Samples with a goodness-of-fit of <95% were excluded. CN calls at specific recurrent breakpoints were removed. Sharing of CNVs between samples from different tissue blocks and across individuals that violated phylogenetic structures implied from SBS and ID phylogenetic trees was treated as artifactual and removed from analysis. Similarly, any recurrent copy number calls with identical breakpoints that were observed across different individuals were also removed. CNV calls were manually verified by visualization of reads in JBrowse72.

### Structural variant calling

Whole-genome sequences were analyzed for somatic structural variants (SVs) using the algorithm breakpoints via assembly (BRASS)73; paired blood samples were used as controls. If no blood sample was available, a tissue sample was used that was phylogenetically distant to the sample under analysis. SV calls were filtered using an in-house algorithm in a multistage process using bespoke software (https://github.com/MathijsSanders/AnnotateBRASS). Finally, all SV calls were manually inspected to confirm somatic variants. SV calls in L1 transposon donor regions and fragile sites were excluded from the final SV analysis.

### Mutational signature analysis

The R package HDP (https://github.com/nicolaroberts/hdp), based on the hierarchical Dirichlet process74, was used to extract mutational signatures. Analysis of mutational signatures using this package has previously been applied to normal tissues28,29,30,31. In brief, this nonparametric Bayesian method models categorical count data using the hierarchical Dirichlet process. A hierarchical structure is established using patients as the first tier (parent nodes) and individual samples as the second tier (dependent nodes). Uniform Dirichlet priors were applied across all samples. The algorithm creates a mutation catalog for each sample and infers the distribution of signatures in any one sample using a Gibbs sampler. We performed mutational signatures analysis per branch, counting each branch of the phylogenetic tree as a distinct sample to avoid double counting of mutations. Since the Markov chain Monte Carlo process scales linearly with the number of counts, we randomly subsampled each branch to a maximum of 2,500 total substitutions. Branches with <100 mutations were excluded from the mutational signature extraction. No reference signatures were included as priors.

To assess the contribution of each mutational process, mutational signatures were refitted to all mutation counts of branches of phylogenies using the R package sigfit (https://github.com/kgori/sigfit)75. To avoid overfitting, a limited subset of reference mutational signatures were included per patient corresponding to the HDP signatures identified in that individual. In the case of SBS10d, it was fitted only to branches in which an exposure had originally been reported.

Further details of mutational signature extraction, validation and mutational signature assignment are included in the Supplementary Note.

### Cancer driver mutations

Cancer driver mutations were identified using two methods aiming to identify genes and mutations in this cohort that are subject to positive selection. Firstly, to identify mutations in cancer genes under positive selection in an unbiased manner, we ran a modified dNdS method76. To avoid double counting of mutations, only unique mutations (SBS and ID) mapped to branches of the phylogenetic trees were analyzed. dndscv was run using the following parameters: max_coding_muts_per_sample=5000 and max_muts_per_gene_per_sample=20. The mutational processes associated with defective DNA polymerases have a well-reported extended sequence context bias8,76 that alters the expected probability of observing a mutation in specific trinucleotide nucleotide contexts. To account for this bias, a modified dNdS method was applied. Global dNdS values for the expected number of each mutation type were replaced with corrected values taking into account the observed mutation subtype (synonymous, missense, nonsynonymous and splice site) totals. A generalized negative binomial linear model was applied to each mutation subtype accounting for the biased distribution observed. P values were combined using Fisher’s method, and multiple testing correction was performed with the Benjamini–Hochberg method. Genes with qval < 0.05 were considered to be under positive selection.

A second phase of cancer gene mutation analysis was undertaken, identifying mutations in this cohort that are codified in cancer mutation databases and exhibit characteristic traits of cancer driver mutations, an approach previously employed in the study of normal tissues30,31. In this phase of the analysis we sought to identify the spectrum and frequency of cancer driver mutations in this cohort. Somatic mutations (SBS and ID) were collated per sample from all tissues. Analysis was restricted to protein-coding regions, and mutations were filtered using lists of known cancer genes; mutations in samples from intestinal epithelium were filtered using a list of 90 genes associated with colorectal cancer28; samples from all other tissues, including blood, were filtered using a pan-cancer list of 369 driver genes76. Genes were then characterized according to their predominant molecular behavior: dominant, recessive or intermediate (those demonstrating aspects of both types of behavior) using the COSMIC Cancer Gene Census77. All candidate mutations were annotated using the cBioportal MutationMapper database (https://www.cbioportal.org/mutation_mapper). Mutations meeting the following criteria were considered to be driver mutations: truncating mutations (those causing a shortened RNA transcript, nonsense, essential splice site, splice region and frameshift ID) in recessively acting genes; known activating hotspot mutations in dominant (and recessive) genes; and, lastly, mutations in neither of the above categories but characterized by the MutationMapper database as being ‘likely oncogenic’ were also included in the final driver mutation catalog. We also sought to compare the frequency of driver mutations in histologically normal crypts with POLE and POLD1 mutations to those from individuals not carrying DNA polymerase mutations. Somatic mutations from 445 normal intestinal crypts28 were annotated and filtered using the above criteria. Comparison was made with normal intestinal crypts from this cohort of individuals with POLE and POLD1 germline mutations (Extended Data Fig. 3).

### Embryonic variant calling

Whole-genome sequencing of bulk blood samples was used to identify early embryonic SBS and ID mutations. Since bulk blood represents a very polyclonal tissue, variants found in blood reflect those generated in the first few cell divisions of life62. Variant counts from blood samples were included in the germline and artifact filtering, as described above. For SBS, a minimum VAF of 0.15 was required to be included in the embryonic set. Of the remaining SBS, 205 out of a total 385 (53%) were shared with intestinal samples, confirming they must have arisen before gastrulation. For ID, we set the minimum to 0.1 to reflect the higher levels of noise accompanying indel calling and variant read counting. For indels, this amounted to 28 out of 30 (93%) mutations.

To investigate the role of POLE mutagenesis in the early embryo, we used the mutational signature contribution to the observed SBS counts, given the highly elevated SBS mutation rate. We fitted SBS1, SBS5, SBS10a, SBS10b and SBS28 to patient-specific embryonic counts using SigFit. SBS1 and SBS5 reflect the normal background mutagenesis already present in the embryo60,62, while the other signatures are caused by defective POLE.

For POLD1 mutagenesis, we quantified the number of insertions of T at homopolymers of T, the characteristic peak in ID1 and the one dominating the indel landscape in patients with POLD1. We used insertions rather than SBS because of the relatively modest increase in SBS mutation rate, but a much higher increase in the rate of insertion acquisition.

### Modified duplex sequencing

DNA from bulk blood and sperm samples from four individuals with germline POLE and POLD1 mutations was extracted as outlined above. Samples from normal healthy controls were obtained and processed using the following method. Whole blood was diluted with PBS, and mononuclear cells (MNC) were isolated using lymphoprep (STEMCELL Technologies) density gradient centrifugation. The red blood cell and granulocyte fraction of the blood was then removed. The MNC fraction was depleted of red blood cells by lytic steps involving three incubations at room temperature for 20, 10 and 10 min, respectively, with RBC lysis buffer (BioLegend). DNA was extracted from sperm samples from a 21-year-old donor.

Our modified duplex sequencing method, called NanoSeq, relies on blunt-end restriction enzymes to fragment the genome to avoid errors associated with the filling of 5' overhangs and the extension of internal nicks during end repair after sonication. Our modified method has error rates <1 × 10–8 (ref. 53).

Sperm NanoSeq libraries from POLE/POLD1 mutants (median insert sizes of 289 bp) were sequenced as 150-bp paired-end reads on NovaSeq, resulting in 2.06–2.32 × 10–8 paired-end reads (20–23× coverage) and 2.5–3.2 × 109 duplex calls (~1× effective coverage). Multiple replicates of sperm (n = 7) and blood/granulocyte (n = 6) NanoSeq libraries from healthy donors were sequenced to higher depth, resulting in 1.5–1.6 × 109 paired-end reads (150–160× coverage) and 1.5–2.5 × 1010 bp calls (5–8× effective coverage).

Given the uneven frequencies of trinucleotides in the digested genome, the strong filtering of common SNP sites (typically occurring at CpG) and the strong dependence of mutation rates on trinucleotide contexts, our estimates of mutation burdens are normalized and projected onto genomic trinucleotide frequencies.

Let t denote the count of a given trinucleotide of type i = 1,…,32. The frequency of each trinucleotide is calculated separately for the genome $$f_i^g$$and for the NanoSeq experiment $$f_i^e$$ where:

$$f_i = \frac{{t_i}}{{\mathop {\sum }\nolimits_{i = 1}^{32} t_i}}$$

The ratio of genomic to experimental frequencies for a given trinucleotide is:

$$r_i = \frac{{f_i^g}}{{f_i^e}}$$

There are j = 1,…,classes of substitution where the mutated base is a pyrimidine. Let sij denote the count of substitution j in trinucleotide context i, giving a total of 96 substitution classes. Each substitution count is corrected as follows:

$$s_{ij}^\prime = s_{ij}r_i$$

The corrected substitution counts provide a substitution profile projected onto the human genome, and are also used to calculate the corrected mutation burden:

$$\beta ^\prime = \frac{{\mathop {\sum }\nolimits_{i = 1}^{32} \mathop {\sum }\nolimits_{j = 1}^6 s_{ij}^\prime }}{{\mathop {\sum }\nolimits_{i = 1}^{32} t_i}}$$

### Sequencing for ARCH-related variants in blood

Twenty-two blood samples were subjected to deep targeted sequencing (median coverage ~10,000×) using a gene panel of known drivers of clonal hematopoiesis58. Samples were sequenced on Illumina Hiseq4000 lanes using 75-bp paired-end reads. Sequencing reads were aligned to the human reference genome (GRCh37d5) using the BWA aligner. ShearwaterML27 was used to call somatic SNVs. This algorithm was developed to detect subclonal mutations in deep-sequencing data, by modeling the error rate at each site using information from the panel of normal unrelated samples. The normal panel we employ comprises data from 310 previously sequenced normal individuals (with no identifiable ARCH mutations) aged 42–89 years. Postprocessing filtering was performed as previously described, with a requirement for variants to have at least two supporting reads in both directions27. Germline variants were excluded by removal of those with VAF > 0.42. Probable false positives were removed by exclusion of variants with VAF < 0.005. Further filtering restricted analysis to variants causing nonsynonymous protein-coding changes or introducing a stop codon.

### Mutation burden in coding regions and mutational age calculations

The mutational ages shown in Fig. 4c,d were calculated by assessing the per-block mutation rate for normal intestinal crypts across the whole-genome and coding regions; average mutation rates were then calculated per individual for the whole-genome and coding regions. These calculations were repeated for individuals from a cohort of wild-type crypts28, generating an expected mutation rate for the whole-genome and coding regions. The relative increase of the observed mutation rate in individuals with DNA polymerase mutations versus wild-type crypts was used to generate mutational age.

### Telomere length estimation

Telomere attrition is a hallmark of cellular aging and is accelerated in certain disease processes. To assess the length of telomeres in the tissue samples in this cohort, we undertook estimation using two established methods from next-generation sequencing data.

Telomerecat is a ploidy-agnostic method of telomere length estimation (to bp resolution) from next-generation sequencing data that have been benchmarked across human and animal studies in normal tissues and cancers78. This method has been employed in previous studies of somatic mutations in normal mutations28,29,30. We generated telomere length estimates for all samples using 100 simulator runs (parameter -N100). Results for most, but not all, samples were plausible and showed a positive correlation with those from a second telomere length content algorithm (TelomereHunter). Approximately 30% of samples returned zero values for telomere length; similar observations have been made in other datasets sequenced on the Illumina NovaSeq platform. Results of the algorithm based on sequencing data generated by Illumina X10 and other sequencing platforms do not demonstrate this pattern, and can be relied upon. For this analysis we favored TelomereHunter, which is a well-established method used in tumor sequencing analyses79, shows good concordance with other methods of telomere length estimation80 and is reliable across all tested samples sequenced on the Illumina NovaSeq platform.

Telomere content measurements were generated by running TelomereHunter using default parameters across all histologically normal crypts in this cohort (n = 109) and normal crypts from a previous study that do not have DNA proofreading polymerase germline mutations (n = 445)28. To assess age-related telomere attrition in normal tissues, we fitted a linear mixed-effects model to assess the effect of age and also to test whether telomere attrition is greater in crypts with a DNA proofreading polymerase mutation. Age was fitted as a fixed effect and patient as a random effect; an additional dichotomous genotype variable was added as a fixed effect. We assume a similar length at birth, and thus fitted a fixed intercept and assessed the difference in slope between samples from this cohort and nonpredisposed crypts. We compared the model fit using analysis of variance and the difference between models using a chi-squared test. P-value thresholds of >0.05 were used (Supplementary Note)

### Research involving human gametes

Samples containing human gametes (sperm samples), collected under informed research consent from individuals enrolled in the CORGI v.2.0 study, were included in this manuscript. The CORGI v.2.0 study is approved by the United Kingdom NHS Research Ethics Committee (no. 17/SC/0079). Sperm samples were studied using genome sequencing. All experiments were conducted in accordance with the relevant international and national standards. No modification or cloning of gametes was performed.

### Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.