Analysis of somatic mutations in whole blood from 200,618 individuals identifies pervasive positive selection and novel drivers of clonal hematopoiesis

Human aging is marked by the emergence of a tapestry of clonal expansions in dividing tissues, particularly evident in blood as clonal hematopoiesis (CH). CH, linked to cancer risk and aging-related phenotypes, often stems from somatic mutations in a set of established genes. However, the majority of clones lack known drivers. Here we infer gene-level positive selection in whole blood exomes from 200,618 individuals in UK Biobank. We identify 17 additional genes, ZBTB33, ZNF318, ZNF234, SPRED2, SH2B3, SRCAP, SIK3, SRSF1, CHEK2, CCDC115, CCL22, BAX, YLPM1, MYD88, MTA2, MAGEC3 and IGLL5, under positive selection at a population level, and validate this selection pattern in 10,837 whole genomes from single-cell-derived hematopoietic colonies. Clones with mutations in these genes grow in frequency and size with age, comparable to classical CH drivers. They correlate with heightened risk of infection, death and hematological malignancy, highlighting the significance of these additional genes in the aging process.


Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.

A description of all covariates tested
A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g.means) or other basic estimates (e.g.regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g.confidence intervals) For null hypothesis testing, the test statistic (e.g.F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.
For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g.Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above.

Software and code
Policy information about availability of computer code Data collection UKBiobank exome sequencing CRAM files were obtained from UKBiobank resources 23143 and 23144.Variant calling files for single cell derived haematopoietic colonies was available internally at the Sanger Institute, and for published works, also available publically as detailed in the following papers (Mitchell et

Data analysis
Mutect2 (broadinstitute/gatk:4.1.3.0) and Shearwater (v3_11, Gerstung et al, Bioinformatics 2014) were used for variant identification.1000 genomes, ExAC and gnomAD were used to remove sequencing artefacts and common germline SNPs.Variants were called within target capture regions (UKBB resource 3801) and 100bps either side and annotated using SNPeff (v4_3) and dbSNP build GRCh38.86.Variants with features commonly associated with false positives, such as alleles only supported by the end of the read, or reads with excessive edit distance, were excluded using FINGs v1.7.1.The R package dNdScv (https://github.com/im3sanger/dndscv)was used to detect gene and global level positive selection using default settings except for the following arguments: max_muts_per_gene_per_sample = Inf, use_indel_sites=T, max_coding_muts_per_sample = Inf.COSMIC v94 (Tate et al Nucleic Acids 2019) was used to create a prior for Shearwater (v3_11).Gene set enrichment analysis and gene expression was analysed using data from Corces et al (GSE74912).
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers.We strongly encourage code deposition in a community repository (e.g.GitHub).See the Nature Portfolio guidelines for submitting code & software for further information.

nature portfolio | reporting summary
April 2023 al, Nature 2022, Williams et al, Nature 2022, Spencer Chapman et al, Nature 2021, Fabre et al, Nature 2022, Machado et al, Nature 2022, Spencer Chapman et al, Blood 2022.Diagnoses of individuals with haematopoietic colony sequencing were as provided by previous publications or as collected following informed consent under NHS Research Ethics Committee approval 18/EE/0199 and 07/MRE/44.