Substantial somatic genomic variation and selection for BCOR mutations in human induced pluripotent stem cells

We explored human induced pluripotent stem cells (hiPSCs) derived from different tissues to gain insights into genomic integrity at single-nucleotide resolution. We used genome sequencing data from two large hiPSC repositories involving 696 hiPSCs and daughter subclones. We find ultraviolet light (UV)-related damage in ~72% of skin fibroblast-derived hiPSCs (F-hiPSCs), occasionally resulting in substantial mutagenesis (up to 15 mutations per megabase). We demonstrate remarkable genomic heterogeneity between independent F-hiPSC clones derived during the same round of reprogramming due to oligoclonal fibroblast populations. In contrast, blood-derived hiPSCs (B-hiPSCs) had fewer mutations and no UV damage but a high prevalence of acquired BCOR mutations (26.9% of lines). We reveal strong selection pressure for BCOR mutations in F-hiPSCs and B-hiPSCs and provide evidence that they arise in vitro. Directed differentiation of hiPSCs and RNA sequencing showed that BCOR mutations have functional consequences. Our work strongly suggests that detailed nucleotide-resolution characterization is essential before using hiPSCs.

I n regenerative medicine, human induced pluripotent stem cells (hiPSCs) and latterly organoids have become attractive model systems because they can be propagated and differentiated into many cell types. Specifically, hiPSCs have been adopted as a cellular model of choice for in vitro disease modeling as well as being considered for cell-based therapies [1][2][3] .
The genomic integrity and tumorigenic potential of human pluripotent stem cells have been explored previously, but systematic large-scale, whole-genome assessments of mutagenesis at single-nucleotide resolution have been limited [4][5][6][7][8] . Human embryonic stem cells (hESCs) cultured in vitro have been reported to harbor TP53 mutations and recurrent chromosomal-scale genomic abnormalities ascribed to selection pressure [9][10][11][12][13][14] . However, in contrast, a recent study showed a low mutation burden in clinical-grade hESCs, and no cancer driver mutations were detected 15 .
The mutational burden in any given hiPSC comprises mutations that were preexisting in the parental somatic cells from which it was derived and mutations that have accumulated over the course of reprogramming, cell culture and passaging 7,[16][17][18][19][20][21][22] . Several small-scale genomic studies have shown that in some cell lines, preexisting somatic mutations make up a substantial proportion of the total burden [22][23][24][25][26][27][28] . With the advent of clinical trials using hiPSCs (e.g., NCT04339764) comes the need to gain in-depth understanding of the mutational landscape and potential risks of using these cells 29,30 . Here we contrast skin-derived (F-hiPSCs) and blood-derived (B-hiPSCs) from one individual. We then comprehensively assess hiPSCs from one of the world's largest stem cell banks, HipSci, and an alternative cohort called Insignia. All lines had been karyotypically prescreened and deemed as chromosomally stable. We utilized combinations of whole-genome sequencing (WGS) and whole-exome sequencing (WES) of 555 hiPSC samples and 141 B-hiPSC-derived subclones (Supplementary Table 1) to understand the extent and origin of genomic damage and the possible implications.
(S2_SF2_P2) had 1,873 single substitutions, 17 double substitutions and 71 indels (Fig. 1b). Mutational signature analysis demonstrated striking predominance of UV-associated substitutions (Reference Signature/COSMIC) Signature 7 (ref. 31 ) in the F-hiPSCs, characterized by C>T transitions at TCA, TCC and TCT ( Fig. 1b and Extended Data Fig. 1). This finding is consistent with previously published work that attributed UV signatures in hiPSCs to preexisting damage in parental skin fibroblasts 8,32 . In contrast, EPC-derived B-hiPSCs did not show any evidence of UV damage but showed patterns consistent with possible oxidative damage (signature 18 Table 2 provides source information. UV-specific features, such as elevated CC>TT double substitutions and UV mutational signatures, were enriched in F-hiPSCs. characterized by C>A mutations at TCT, GCA and ACA; Fig. 1 and Extended Data Fig. 1). Consistent with in vitro studies 33,34 , double substitutions were enriched in UV-damaged F-hiPSCs (Fig. 1b). In all, we concluded that F-hiPSCs carry UV-related genomic damage as a result of sunlight exposure in vivo that does not manifest in EPC-derived B-hiPSCs. Importantly, screening for copy-number aberrations underestimated the substantial substitution/indel-based variation that exists in hiPSCs.
High prevalence of UV-associated DNA damage in F-hiPSCs. We asked whether these findings were applicable across F-hiPSC lines generally. Therefore, we interrogated all lines in the HipSci stem cell bank, comprising 452 F-hiPSCs generated from 288 healthy individuals ( Fig. 2a and Supplementary Table 3). These F-hiPSC lines were generated using Sendai virus, cultured on irradiated mouse embryonic fibroblast feeder cells, which have been reported to help reduce genetic instability 20 and were expanded before sequencing (range, 7-46 passages; median, 18 passages; Extended Data Fig. 2) 35 . WGS data were available for 324 F-hiPSC and matched fibroblast lines (Fig. 2a). We sought somatic mutations and identi-  Table 4).
Mutational signature analysis 31 revealed that 72% of F-hiPSCs carried detectable substitution signatures of UV damage (Fig. 2c). hiPSCs with greater burden of UV-associated substitution signatures showed strong positive correlations with UV-associated CC>TT double substitutions 36,37 (Fig. 2d and Extended Data Fig. 3) and demonstrated clear transcriptional strand bias with an excess of C>T and CC>TT on the nontranscribed strand, enriched more in early replication timing domains 38-41 than in late ones (Fig. 2e,f). These findings are consistent with previous reports of UV-related mutagenesis observed in fibroblasts [36][37][38][39][40][41] . Of note, similar to findings of UV damage in skin 42 , there was no correlation between total mutation burden of F-hiPSCs and donor age or gender (Fig. 2g,h).

Substantial genomic heterogeneity between F-hiPSCs clones.
F-hiPSCs comprise the majority of hiPSCs in stem cell banks globally and are a prime candidate for use in disease modeling and cell-based therapies. Yet, we and others observed substantial heterogeneity between hiPSC sister lines generated from one reprogramming experiment (Fig. 1b, subject S2) 5,6,23,32 . It has been postulated that this heterogeneity may result from the presence of genetically diverse clones within the fibroblasts. To explore this further, we compared mutational profiles of 118 pairs of F-hiPSCs present in HipSci, each pair having resulted from the same reprogramming experiment. In all, 54 pairs (46%) of hiPSCs shared more than ten mutations and had similar substitution numbers and profiles (cosine similarity >0.9; Fig. 3a). The remaining 64 hiPSC pairs (54%) shared ten or fewer substitutions and were dissimilar in burden and profile (cosine similarity <0.9; Fig. 3a). We found some striking differences; for example, the F-hiPSC line HPSI0314i-bubh_1, derived from donor HPSI0314i-bubh, had 900 substitutions with no UV signature, whereas HPSI0314i-bubh_3 had 11,000 substitutions with representation mostly from UV-associated damage (>90%). Analysis of the parental fibroblast line HPSI0314i-bubh showed some, albeit reduced, evidence of UV-associated mutagenesis. Hence, we postulate that F-hiPSCs from the same reprogramming experiment could show considerable variation in mutation burden because of different levels of sunlight exposure to each parental skin fibroblast.
To investigate further, we analyzed bulk-sequenced skin fibroblasts, which revealed high burdens of substitutions, CC>TT double substitutions and indels (Extended Data Fig. 4a) consistent with UV exposure in the majority of fibroblasts (166/204) (Extended Data Fig. 4b). Mutation burdens of F-hiPSCs were positively correlated with their matched fibroblasts (Fig. 3b), directly implicating the fibroblast population as the root cause of F-hiPSC mutation burden and heterogeneity. Investigating variant allele frequency (VAF) distributions of somatic mutations in F-hiPSCs and fibroblasts demonstrated that most F-hiPSC populations were clonal (VAF = 0.5), whereas most fibroblast populations showed oligoclonality (VAF < 0.5) (Fig. 3c and Extended Data Fig. 5). At least two peaks were observed in VAF distributions of fibroblasts: a peak close to VAF = 0 (representing the neutral tail due to accumulation of mutations, which follows the power law distribution) and a peak close to VAF = 0.25 (representing the subclones in the fibroblast population). From the mean VAF of the cluster, we can estimate the relative size of the subclones (e.g., VAF = 0.25, indicating that the subclone occupies half of the fibroblast population 43 ). According to this principle, we found that ~68% of fibroblasts contain oligoclonal populations with a VAF < 0.25. The mutational burden of F-hiPSCs is thus dependent on which specific cells they were derived from in that oligoclonal fibroblast population. F-hiPSCs derived from the same subclone will be more similar to each other, whereas hiPSCs derived from different subclones within the fibroblast population could have hugely different mutation burdens (Fig. 3d).
There are important implications that arise from subclonal heterogeneity observed in fibroblasts. First, when detecting somatic mutations, it is preferable to compare the F-hiPSC genome to a matched germline sample if possible; otherwise, F-hiPSC mutations that are also present in a prominent fibroblast subclone will be dismissed as germline variants, giving a false sense of low DNA damage in F-hiPSCs ( Supplementary Fig. 1). Indeed, ~95% of HipSci F-hiPSCs had some shared mutations with matched fibroblasts (Extended Data Fig. 6) that demonstrated a strong UV signature (Extended Data Fig. 7). Second, some F-hiPSC mutations may be present in the parental fibroblast population but not detected through lack of sequencing depth. In comparing WES data of the originating fibroblasts, at standard and at high coverage (hcWES), we found that an increased sequencing depth uncovered additional coding mutations that had been acquired in vivo; WES data showed 47% of coding mutations detected in hiPSCs were shared with matched fibroblasts, compared to 64% using hcWES (Extended Data Fig. 8). The additional 17% of mutations identified only in hcWES exhibited a strong UV substitution signature (Extended Data Fig. 8). suggesting that they may have been acquired in vivo and have been present within the parental fibroblast population but undetected at standard sequence coverage. Given recent sequencing studies that have demonstrated a high level of cancer-associated mutations in normal cells 37,[44][45][46][47][48] , it is therefore probable that some mutations identified in hiPSCs but not detected in corresponding fibroblasts were still acquired in vivo and not during cell culture. Third, this work highlights the need for careful clone selection and comprehensive genomic characterization, as reprogramming can produce hiPSC clones with vastly different genetic landscapes.
Strong selection for BCOR mutations in hiPSCs. When we mapped hiPSC coding mutations to the COSMIC Cancer Gene Census (a proxy for genes that are selected for), we found a total of 272 mutations in 145 of these cancer genes, across 177 lines (177/452, 39%) from 137 donors. However, it is possible that some of these mutations are passenger mutations, given the heavy mutagenesis from UV damage in some hiPSCs. Hence, to determine potential selective advantage of coding mutations to F-hiPSCs, we conducted an agnostic analysis of selective pressure based on the ratio of divergence of non-synonymous to synonymous substitutions or dN/dS and hotspot analysis across the HipSci F-hiPSCs cohort by examining all genes in the genome 49 . Although several cancer genes were hit across multiple F-hiPSC lines (Supplementary Table 5), only BCOR was found to demonstrate statistically significant positive selection (q = 3.64 × 10 −8 ; Fig. 4a and Supplementary Table 6) using dNdScv 49 . To increase the statistical power for detection of selection, a restricted hypothesis test of known cancer genes 49 still revealed only BCOR mutations as significant. Interestingly, all the BCOR mutations found in 11 F-hiPSCs were truncating mutations and predicted to be pathogenic (Fig. 4b). No reads containing these BCOR mutations were seen in any founding fibroblast, even when  some fibroblasts were sequenced to high coverage (>150×, Fig. 4b), indicating that BCOR mutations observed in F-hiPSCs were more likely to have arisen in vitro.
We extended our analysis to B-hiPSCs in the HipSci cohort. These B-hiPSCs were generated from erythroblasts derived from peripheral blood. Three of 17 sequenced B-hiPSCs carried mutations in Hollow circles indicate that two hiPSCs from the same donor share fewer than ten mutations, whereas filled circles indicate that they share ten or more mutations. Colors denote cosine similarity values between mutation profiles of two hiPSCs. A very high score (purple hues) indicates a strong likeness. n = 164 fibroblasts with two derived hiPSCs. b, Correlations between number of mutations in hiPSCs and their matched fibroblasts, n = 324 WGS of hiPSCs. c, Summary of subclonal clusters in fibroblasts (n = 204 WGS of fibroblasts) and hiPSCs (n = 324 WGS of hiPSCs). Kernel density estimation was used to smooth the distribution. Local maximums and minimums were calculated to identify subclonal clusters. Each dot represents a cluster that has at least 10% of total mutations in the sample. Most fibroblasts are polyclonal with a cluster VAF near 0.25, whereas hiPSCs are mostly clonal, with VAF near 0.5. d, Schematic illustration of genomic heterogeneity in F-hiPSCs. Through bulk sequencing of fibroblasts, all of these cells will carry all the mutations that were present in the gray cell, their most recent common ancestor. The individual cells will also carry their unique mutations depending on the DNA damage received by each cell. Each hiPSC clone is derived from a single cell. Subclone 1 and subclone 2 cells are more closely related and could share a lot of mutations in common, because they share a more recent common ancestor. However, they are distinct cells and will create separate hiPSC clones. Subclone 6 (orange cells) and subclone 12 (red cells) are not closely related to the green cells and have received more DNA damage from UV, making them genomically divergent from the green cells. They could still share some mutations in common, because they shared a common ancestor at some early point, but will have many of their own unique mutations (largely due to UV damage). Table 7), a higher proportion than F-hiPSCs in HipSci (P = 0.0076, binomial test). However, many B-hiPSCs in the HipSci cohort did not have matched germline samples to perform subtraction of germline variation, rendering it possible (even if unlikely) for the BCOR mutations to be germline in origin. Therefore, we sought alternative cohorts of hiPSCs. Blood derivation methods for generating hiPSCs have been gaining popularity due to the ease of sample collection but are much less common than F-hiPSCs, and large cohorts of genomically characterized B-hiPSCs do not exist. Nevertheless, we accessed erythroblast-derived B-hiPSCs created from 78 individuals who were part of the Insignia project 50 , comprising 53 patients with inherited DNA repair defects, 5 patients with exposure to environmental agents and 20 healthy controls (Supplementary Table 8  Source of recurrent BCOR mutations. BCOR encodes for the BCL6-corepressor protein and is a member of the ankyrin repeat domain containing gene family. The corepressor expressed by BCOR binds to BCL6, a DNA-binding protein that acts as a transcription repressor for genes involved in regulation of B cells. Somatic mutations in BCOR have been reported in hematological malignancies, including acute myeloid leukemia and myelodysplastic syndromes [51][52][53] , and has also been reported at a low prevalence in other cancers, including lung, endometrial, breast and colon cancers 54 . The high prevalence of BCOR mutations in B-hiPSCs, but not F-hiPSCs, led us to ask whether these could have been derived from hematopoietic stem cell clones that were present in the donors. Clonal hematopoiesis (CH) has been reported in older individuals 55,56 . Some of the most common genes that are mutated in CH include DNMT3A, TET2, ASXL1, JAK2 and TP53 [57][58][59] . BCOR is not one of the frequently mutated CH genes but has been reported in rare cases of aplastic anemia 53 . However, DNMT3A, reported once in our cohort, could represent a B-hiPSC derived from a CH clone, as the mutation p.G543V has also been reported as a CH variant 60 . Another possibility is that BCOR dysregulation is selected for in the culture process particularly in erythroblast-derived hiPSCs.   To examine these possibilities, we asked whether the BCOR variants could be detected earlier in the B-hiPSC derivation process, either at the erythroblast population stage or in the germline DNA sample. We did not observe any of these BCOR variants in any of the sequencing reads in either erythroblast or germline DNA samples. It is nevertheless possible that CH clones escaped detection through lack of very deep sequencing depth.

BCOR (Supplementary
Notably, 6 of 21 BCOR mutations were present at lower VAFs (VAF < 0.3) in the B-hiPSCs and could represent subclonal BCOR variants within the parental B-hiPSC clone. This would suggest that BCOR mutations were arising because of ongoing selection pressure in culture following erythroblast derivation or after reprogramming. To investigate further, we cultured parental B-hiPSCs for 12-15 days. A minimum of two (and up to four) subclones were derived from the parental lines ( Fig. 4c and Table 12). In all parental B-hiPSC clones where BCOR mutations were identified and where daughter subclones were successfully generated, the BCOR variants were recapitulated in related daughter subclones, serving as an internal validation of those BCOR variants ( Fig. 4c and Supplementary Table 10). Interestingly, 7 additional B-hiPSCs that did not have BCOR mutations in the parental B-hiPSC population developed new BCOR mutations in daughter subclones; MSH71 had p.P1229fs*5 BCOR mutations in both subclones but not the parent, MSH68 had p.D1118fs*22 BCOR mutations in one of two subclones, MSH13 had p.P1115fs*45 BCOR mutation in one of four subclones, MSH29 had a p.S1122fs*37 BCOR mutation in one of two subclones, MSH90 had a p.D1118fs*44 in BCOR in two subclones, MSH3 had a p.S158fs*28 BCOR mutation in all three subclones and MSH41 had p.R1398fs*4 in BCOR in two subclones ( Fig. 4c and Supplementary Table 10). Given that these BCOR mutations are present in some, but not all, subclones and are not present at a detectable frequency in the parental population, this finding suggests that they have arisen late in culture of the parental B-hiPSC. All BCOR mutations in B-hiPSCs were predicted to be truncating variants distributed throughout the gene (Fig. 4d). No correlation was found between mutation burden in parental line and BCOR status in either F-hiPSCs (P = 0.77, Mann-Whitney test) or B-hiPSCs (P = 0.55, Mann-Whitney test), indicating that BCOR mutations were not simply mutations in hypermutated samples (Fig. 4e).
Therefore, in this analysis, we found a high prevalence of recurrent BCOR mutations in two independent cohorts of erythroblast-derived B-hiPSCs (18% of HipSci and 27% of Insignia (~25% overall, across all B-hiPSCs)). We did not find BCOR mutations in originating erythroblasts or in germline DNA samples. Instead, single-cell-derived subclones variably carried new BCOR mutations, suggesting that there may be selection for BCOR dysfunction post-reprogramming in B-hiPSCs. We cannot definitively exclude CH as the source of BCOR mutations, as extremely high sequencing depths may be required to detect CH clones. However, BCOR is not frequently mutated in CH. The majority of donors were young (<45 yr), and some were healthy controls. All arguments against BCOR being due to CH, and more likely due to selection pressure in B-hiPSCs in culture. That BCOR mutations are seen at a lower frequency in F-hiPSCs and never in fibroblasts hints at a culture-related selection pressure (Fig. 4b). Why BCOR mutations are more enriched in B-hiPSCs remains unclear but may be related to the process of transforming peripheral blood mononuclear cells (PMBCs) toward the myeloid lineage.

Global transcriptional changes in BCOR-mutated B-hiPSCs.
To understand the functional impact of recurrent BCOR variants in B-hiPSCs, we performed RNA sequencing on B-hiPSCs from all 78 Insignia donors. Global transcriptomic analysis revealed two principal components (PCs) driving variance in the dataset (Fig. 5a). The first PC distinguished two groups, with almost all the BCOR-mutated B-hiPSCs (VAF > 0, highlighted in yellow or orange in Fig. 5a) restricted to one group, herewith termed BCOR-mut. The other group comprised B-hiPSCs with no BCOR mutations, BCOR-wt (VAF = 0, highlighted in gray or blue in Fig. 5a). The second PC distinguished the donors by gender; BCOR is on the X chromosome. This result suggests that BCOR mutations are associated with important global transcriptional changes in B-hiPSCs. Differential gene expression analysis showed that 10,486 genes were differently regulated in the BCOR-mut lines (Fig. 5b and Supplementary Table 13). Furthermore, we found that UTF1 (implicated in maintenance of pluripotency through chromatin regulation), VENTX, IRX4, PITX2 and MIXL1 (all homeobox-related proteins with various roles in embryonic patterning) and FOXC1 (important in the development of organs derived from the mesodermal-lineage) were strongly upregulated in BCOR-mut lines, whereas RAX (involved in development of the hypothalamus and retina) was strongly downregulated (Fig. 5b).

BCOR-mutant hiPSCs have impaired differentiation capacity.
BCOR is a member of the polycomb group of proteins (PRC1.1) that regulates self-renewal and differentiation of stem cells and is critical for maintaining pluripotency 61 . BCOR depletion has been associated with decreased polycomb repressive activity, initiating differentiation toward the endodermal and mesodermal lineages 61 . Thus, we next asked whether the differentiation capacity was compromised in BCOR-mut lines, particularly for ectodermal lineages. We performed directed differentiation toward a neuronal lineage, contrasting two independent BCOR-wt B-hiPSC lines, MSH34i2 and MSH30i3, and two independent BCOR-mut B-hiPSC lines, MSH40i2 and MSH93i6 (Fig. 5c,d), each with three biological replicates. To capture dynamic shifts through directed differentiation, we took samples during a time course, characterizing these lines morphologically and transcriptionally on day 0 as hiPSCs, day 6 and day 12 representative of early and late neural stem cell (NSC) induction stages respectively and on day 27 as neurons.
There were no overt morphological disparities between BCOR-mut and BCOR-wt colonies at the hiPSC stage, with near-identical brightfield images and immunofluorescence characterization of pluripotent markers SSEA4 and OCT4 (day 0; Fig. 5d, Extended Data Figs. 9 and 10a). However, upon NSC induction (days 6 and 12), BCOR-mut replicates showed inefficiency in differentiation confirmed by patchy expression of NSC marker PAX6 ( Fig. 5d and Extended Data Fig. 10b). At day 27, neuronal generation in BCOR-mut lines was markedly affected, as seen in depletion of neuronal marker TUBB3 ( Fig. 5d and Extended Data Fig. 10c).
In keeping with the morphological characterization, global transcriptomics of each hiPSC line at each stage of differentiation revealed differences between BCOR-mut and BCOR-wt B-hiPSCs ( Supplementary Fig. 2). At the pluripotent stage, BCOR-mut hiPSCs exhibited a modest reduction in BCOR expression but elevated levels of NANOG, KLF4 and NODAL compared to BCOR-wt (Fig. 5c), similar to reports in other pluripotent models with BCOR-PRC1.1 defects 61 .
At the NSC induction stage (days 6 and 12), transcriptional dynamics evolved to show that mesodermal markers such as PAX7, TBX1 and PAX3 were substantially elevated in the BCOR-mut compared to BCOR-wt B-hiPSCs, implicating a drive toward mesodermal lineages in BCOR compromised lines (Fig. 5c). This difference was maintained at late neuronal differentiation (day 27), where neuronal markers such as TUBB3, DCX and FOXG1 were upregulated in BCOR-wt but not BCOR-mut B-hiPSCs, underscoring the failure of neuronal differentiation in the latter (Fig. 5c).
In all, these results demonstrate that B-hiPSC lines that acquire BCOR mutations may have compromised differentiation potential. BCOR-mutated lines seem to be less efficient at differentiation toward neuronal lineages and transcriptionally appear primed to differentiate toward mesodermal lineages.
Long-term culture increases 8-oxo-dG DNA damage. Finally, we examined the genomic effects of in vitro hiPSC culture. Mutational signature analyses revealed that all hiPSCs, regardless of primary cell of origin, have imprints of substitution signature 18, previously hypothesized to be due to oxidative damage in culture 22 (Figs. 1b  and 2c). Recently, knockouts of OGG1, a gene encoding a glycosylase in base excision repair that specifically removes 8-oxoguanine (8-oxo-dG), have been shown to result in mutational signatures characterized by C>A at ACA>AAA, GCA>GAA, GCT>GAT, identical to signature 18 (ref. 62 ). Of note, signature 18 has also been reported in other cell culture systems such as in ES cells 15 , near haploid cell lines 63 and human tissue organoids in which the contribution to the overall mutation burden was reported to increase with in vitro culture 64,65 .
We examined mutations shared between hiPSCs and their matched fibroblasts, representing in vivo and/or early in vitro mutations, and private mutations that are only present in hiPSCs, most (but not all) of which are likely representative of mutations acquired in vitro. We observed that signature 18 is enriched among private mutations of nearly all hiPSCs (313 or 97%). By contrast, the Transcriptomics changes are correlated with BCOR mutation VAF values. 0* denotes that no BCOR mutations were found in hiPSC parental lines but were seen in their subclones. b, Volcano plot of differential gene expression analysis in hiPSCs with BCOR mutations (BCOR-mut) and no BCOR mutations (BCOR-wt) lines. FC, fold change. c, Heatmap of relative expression levels of members of noncanonical polycomb repressive complex 1.1 (PRC1.1), hiPSCs pluripotency and the three germ layers markers for two BCOR-wt (blue: MSH34i2, MSH30i3) and two BCOR-mut (orange: MSH40i2, MSH93i6) hiPSC lines. d, Immunofluorescence characterization of hiPSCs, neural stem cells (NSCs) and neurons at day 0, day 12 and day 27, respectively. Both BCOR-mut and BCOR-wt hiPSCs have similar undifferentiated morphology and express pluripotency markers OCT4/POU5F1 (red) and SSEA4 (green). BCOR-mut lines undergo inefficient neural induction, as highlighted by a reduced number of NSCs expressing PAX6 (green) at day 12 and a reduced number of neurons marked by TUBB3 (green) at day 27. majority of hiPSCs (262 or 80%) showed no evidence of signature 18 among shared variants (Fig. 6a-c and Supplementary Fig. 3). We then investigated the relationship between signature 18 and passage number in the HipSci cohort. We found that there was a positive correlation between signature 18 and passage number (correlation = 0.327; P = 5.013 × 10 −9 ; Fig. 6d), reinforcing the notion that prolonged time in culture is likely to be associated with increased acquisition of somatic mutations through elevated levels of DNA damage from 8-oxo-dG (Fig. 6e).

Discussion
hiPSCs are on the verge of entering clinical practice. Therefore, there is a need to better understand the breadth and source of mutations in hiPSCs to minimize the risk of harm. Crucially, our work shows that the choice of a starting material for hiPSC derivation is complex. We demonstrate copious UV-associated genomic damage and substantial variation in genomic integrity between different clones from the same reprogramming experiment. In all, 39% of F-hiPSCs carried at least one mutation in a cancer driver gene. However, only BCOR mutations showed evidence of statistically significant positive selection, both in F-hiPSCs and in B-hiPSCs (prevalence of 25%), despite lower rates of genome-wide mutagenesis in the latter. The lack of BCOR mutations in the founder somatic cell populations (fibroblasts and erythroblasts) and the finding of new BCOR mutations arising in subclones during propagation experiments provides support for strong selection for BCOR variants in hiPSC culture. Furthermore, our work demonstrates that these mutations are associated with transcriptomic changes and influence differentiation of hiPSCs. Finally, both F-hiPSCs and B-hiPSCs exhibit oxidative damage that is increased by long-term culture. Critically, all of these lines had been previously screened for large-scale aberrations, and therefore, our work shows the value of WGS to fully capture variation at the resolution of nucleotides in hiPSCs. Further work will be necessary to investigate the functional significance of these mutations; for example, do they predispose cells toward malignant transformation or alter the fate of differentiated progeny? Our work highlights best practice points to consider when establishing hiPSC cellular models: an originating somatic cell type with low levels of preexisting genomic damage and minimizing duration of cell culture. Ultimately, however, comprehensive genomic characterization is indispensable to fully understand the magnitude and significance of mutagenesis in all cellular models.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/ s41588-022-01147-3.

Methods
Samples. The research project complied with all relevant ethical regulations, and the protocols were approved by research ethics committees (details below). All participants were recruited voluntarily; provided written, informed consent; and were not financially compensated. The procedures used to derive fibroblasts and EPCs from two donors (S2 and S7) were previously described 22 . Two F-hiPSCs were obtained from S2. Four and two EPC B-hiPSCs were obtained from S7 and S2, respectively. A total of 452 F-hiPSCs were obtained from 288 donors and 17 erythroblast B-hiPSCs were obtained from 9 donors from the HipSci project. A total of 78 erythroblast B-hiPSCs and 141 subclones were obtained from 78 donors from the Insignia project. The details of HipSci and Insignia iPSC lines are described below.
HipSci hiPSC line generation and growth. All lines were incubated at 37 °C and 5% CO 2 (ref. 35  For erythroblast derivation, two blood samples were obtained for each donor. All samples were scanned and checked for ethical approval. Peripheral blood mononuclear cell (PBMC) isolation, erythroblast expansion and hiPSC derivation were done by the Cellular Generation and Phenotyping facility at the Wellcome Sanger Institute, Hinxton. Briefly, whole-blood samples collected from consented patients were diluted with PBS, and PBMCs were separated using standard Ficoll Paque density gradient centrifugation method. Following the PBMC separation, cells were cultured in expansion media containing StemSpan H3000, stem cell factor, interleukin-3, erythropoietin IGF-1 and dexamethasone for a total of 9 days 66 .
Fibroblasts and erythroblasts were transduced using nonintegrating Sendai viral vectors expressing human OCT3/4, SOX2, KLF4 and MYC51 (CytoTune, Life Technologies, A1377801) according to the manufacturer's instructions and cultured on irradiated mouse embryonic fibroblasts (MEFs; CF1). The EPCs were transduced using four Moloney murine leukemia retroviruses containing the coding sequences of human OCT4, SOX2, KLF4 and C-MYC and also cultured on irradiated MEFs.
Following all reprogramming experiments, the medium was changed to hiPSC culture medium 35 containing advanced DMEM (Life Technologies), 10% knockout serum replacement (Life Technologies), 2 mM l-glutamine (Life Technologies), 0.007% 2-mercaptoethanol (Sigma-Aldrich) and 4 ng ml −1 recombinant zebrafish fibroblast growth factor 2 (CSCR, University of Cambridge) and 1% penicillin/ streptomycin (Life Technologies). Cells with an iPSC morphology first appeared approximately 14 to 28 days after transduction, and undifferentiated colonies (six per donor) were picked between days 28 and 40, transferred onto 12-well MEF-CF1 feeder plates and cultured in hiPSC medium with daily medium changes until ready to passage.
Successful reprogramming was confirmed via genotyping array and expression array 35 . Pluripotency quality control (QC) was performed based on the HipSci QC steps, including the PluriTest using expression microarray data from the Illumina HT12v4 platform and copy number variation and loss of heterozygosity (CNV/ LOH) detection using the HumanExome BeadChip Kit platform.
Pluripotent hiPSC lines were transferred onto feeder-free culture conditions, using 10 µg ml −1 Vitronectin XF (Stemcell Technologies)-coated plates and Essential 8 (E8) medium (DMEM/F12 (HAM), E8 supplement (50×) and 1% penicillin/ streptomycin; Life Technologies) 35 . The media was changed daily, and cells were passaged every 5-7 days, depending on the confluence and morphology of the cells, at a maximum 1:3 split ratio until established, usually at passage five or six. The passaging method involved washing the confluent plate with PBS and incubating with PBS-EDTA (0.5 mM) for 5-8 min. After removing the PBS-EDTA, cells were resuspended in E8 media and replated onto Vitronectin-coated plates 35 . Once the hiPSCs were established in culture, lines were selected based on morphological qualities (undifferentiated, roundness and compactness of colonies) and expanded for banking and characterization. DNA from fibroblasts and hiPSCs was extracted using Qiagen Chemistry on a QIAcube automated extraction platform.
HipSci hiPSC line whole-exome genome library preparation and sequencing. A 96-well plate containing 500 ng genomic DNA in 120 µl was cherry-picked and an Agilent Bravo robot used to transfer the gDNA into a Covaris plate with glass wells and adaptive focused acoustics (AFA) fibers. This plate was then loaded into the LE220 for the shearing process. The sheared DNA was then transferred out of this plate and into an Eppendorf TwinTec 96 plate using the Agilent Bravo robot. Samples were then purified ready for library prep. In this step, the Agilent NGS Workstation transferred AMPure-XP beads and the sheared DNA to a Nunc deep-well plate, then collected, and washed the bead-bound DNA. The DNA was eluted and transferred along with the AMPure-XP beads to a fresh Eppendorf TwinTec plate. Library construction comprised end repair, A-tailing and adapter ligation reactions, performed by a liquid handling robot.
In this step, the Agilent NGS Workstation transferred PEG/NaCl solution and the adapter ligated libraries containing AMPure-XP beads to a Nunc deep-well plate and size-selected the bead-bound DNA. The DNA was eluted and transferred to a fresh Eppendorf TwinTec plate. Agilent Bravo and Labtech Mosquito robotics were used to set up a 384-well quantitative polymerase chain reaction (qPCR) plate, which was then ready to be assayed on the Roche Lightcycler. The Bravo was used to create a qPCR assay plate. This 384-well qPCR plate was then placed in the Roche Lightcycler. A Beckman NX08-02 was used to create an equimolar pool of the indexed adapter ligated libraries. The final pool was then assayed against a known set of standards on the ABI StepOne Plus. The data from the qPCR assay was used to determine the concentration of the equimolar pool. The pool was normalized using Beckman NX08-02. All paired-end sequencing was performed using a range of Illumina HiSeq platforms as the lines were generated over many years (HiSeq 2000 onwards). The sequencing coverage of WGS, WES and hcWES in hiPSC lines are 41×, 72× and 271×, respectively.
HipSci hiPSC sequence alignment, QC and variant calling. Reads were aligned to the human genome assembly GRCh37d5 using bwa version 0.5.10 (ref. 68 ) ('bwa aln -q 15' and 'bwa sampe') followed by quality score recalibration and indel realignment using GATK version 1.5-9 (ref. 69 ) and duplicate marking using biobambam2 version 0.0.147. VerifyBamID version 1.1.3 was used to check for possible contamination of the cell lines, and all but one passed ( Supplementary Fig. 4).
Variable sites were called jointly in each fibroblast and hiPSC sample using BCFtools/mpileup and BCFtools/call version 1.4.25. The initial call set was then prefiltered to exclude germline variants that were above 0.1% minor allele frequency in 1000 Genomes phase 3 (ref. 70 ) or ExAC 0.3.1 (ref. 71 ). For efficiency we also excluded low coverage sites that cannot reach statistical significance and for subsequent analyses considered only sites that had a minimum sequencing depth of 20 or more reads in both the fibroblast and hiPSC and at least 3 reads with a nonreference allele in either the fibroblast or hiPSC sample. At each variable site a Fisher's exact test was performed on a two-by-two contingency table, with rows representing the number of reference and alternate reads and the columns the fibroblast or hiPSC sample. This approach for mutation calling is implemented through BCFtools/ad-bias, and we have adopted it preferentially instead of existing tumor-normal somatic-variant calling tools because, by definition, tools developed for the analysis of tumor-normal data assume that mutations of interest are absent from the normal tissue. However, in our experiment, many mutations were present, albeit at low frequency, in the source tissue fibroblasts.
More information on bcftools ad-bias can be found on the online bcftools-man page at http://samtools.github.io/bcftools/bcftools-man.html. The ad-bias protocol is distributed as a plugin in the main bcftools package, which can be downloaded from http://www.htslib.org/download/. Bcftools ad-bias implements a Fisher test on a 2 × 2 contingency table that contains read counts of reference/alternate alleles found in either the iPSC or fibroblast sample. We ran bcftools ad-bias with default settings as follows: where 'exome.bcf ' was the BCF file created by our variant calling pipeline, described in Methods, and sample.pairs.txt was a file that contained matched pairs of the iPSC and corresponding fibroblast sample, one per line, as follows: We corrected for the total number of tests (84.8 M) using the Benjamini-Hochberg procedure at a false discovery rate of 5%, equivalent to a P value threshold of 9.9 × 10 −4 , to call a mutation as a significant change in allele frequency between the fibroblast and iPSC samples. Furthermore, we annotated sites from regions of low mappability and sites that overlapped with copy-number alterations previously called from array genotypes 35 and removed sites that had greater than 0.6 alternate allele frequency in either the fibroblast or hiPSC, as these sites are likely to be enriched for false positives. Dinucleotide mutations were called by sorting mutations occurring in the same iPSC line by genomic position and marking mutations that were immediately adjacent as dinucleotides.
Mutational signature analysis. Mutational signature analysis was performed on S7 EPC-hiPSCs, S2 F-hiPSCs, S2 EPC-hiPSCs and the HipSci F-hiPSC WGS dataset. All dinucleotide mutations were excluded from this analysis. We generated 96-channel single substitution profiles for 324 hiPSCs and 204 fibroblasts. We fitted previously discovered skin-specific substitutions to each sample using an R package (signature.tools.lib) 31 . Function SignatureFit_withBootstrap() was used with default parameters. In downstream analysis, the exposure of two UV-caused signatures Skin_D and Skin_J were summed up to represent the total signature exposure caused by UV (signature 7). A de novo signature extraction was performed on 324 WGS HipSci F-hiPSCs to confirm that the UV-associated skin signatures (Skin_D and Skin_J, signature 7) and culture-associated one (Skin_A, signature 18) are also the most prominent signatures identified in de novo signature extraction ( Supplementary Fig. 5).

Analysis of C>T/CC»TT transcriptional strand bias in replication timing regions. Reference information of replication timing regions were obtained from
Repli-seq data of the ENCODE project (https://www.encodeproject.org/) 73 . The transcriptional strand coordinates were inferred from the known footprints and transcriptional direction of protein coding genes. In our dataset, we first orientated all G>A and GG>AA to C>T and CC>TT (using pyrimidine as the mutated base). Then, we mapped C>T and CC>TT to the genomic coordinates of all gene footprints and replication timing regions. Lastly, we counted the number of C>T/ CC>TT mutations on transcribed and nontranscribed gene regions in different replication timing regions.

Identification of fibroblast-shared mutations and private mutations in HipSci F-hiPSCs.
We classified mutations (substitutions and indels) in HipSci F-hiPSCs into fibroblast-shared mutations and private mutations. Fibroblast-shared mutations in hiPSCs are the ones that have at least one read from the mutant allele found in the corresponding fibroblast. Private mutations are the ones that have no reads from the mutant allele in the fibroblast. Mutational signature fitting was performed separately for fibroblast-shared substitutions and private substitutions in hiPSCs. For indels, only the percentage of different indel types was compared between fibroblast-shared indels and private indels.

Clonality of samples.
We inspected the distribution of VAFs of substitutions in HipSci fibroblasts and HipSci F-hiPSCs. Almost all hiPSCs had VAFs distributed around 50%, indicating that they were clonal. In contrast, all fibroblasts had lower VAFs, which distributed around 25% or lower, indicating that they were oligoclonal. We computed kernel density estimates for VAF distributions of each sample. Based on the kernel density estimation, the number of clusters in a VAF distribution was determined by identification of the local maximum. Accordingly, the size of each cluster was estimated by summing up mutations having VAF between two local minimums.
Variant consequence annotation. Variant consequences were calculated using the Variant Effect Predictor 74 and BCFtools/csq 75 . For dinucleotide mutations, we recorded only the most impactful consequence of either of the two members of the dinucleotide, where the scale from least to most impactful was intergenic, intronic, synonymous, 3′ untranslated region, 5′ untranslated region, splice region, missense, splice donor, splice acceptor, start lost, stop lost and stop gained. We identified overlaps with putative cancer driver mutations using the COSMIC ' All Mutations in Census Genes' mutation list (CosmicMutantExportCensus.tsv.gz) version 92, 27 August 2020. dNdScv analysis. To detect genes under positive selection, we used dN/dS ratios as implemented in the dNdScv R package (https://github.com/im3sanger/dndscv) 49 . dNdScv uses maximum likelihood models to calculate the ratio of nonsynonymous to synonymous mutations per gene, normalized by sequence composition, trinucleotide substitution rates and the local mutability of each gene based on epigenetic covariates.
Three analyses were run for 452 F-hiPSCs and 78 B-hiPSCs sequencing data: (1) default dNdScv (exome-wide, looking at all genes in the genome or exome for selection); (2) restricted hypothesis testing of known cancer genes (to increase the statistical power on known drivers, using the gene list from Martincorena et al. 49 ); and (3) detection of mutational hotspots (using the sitednds function in dNdScv on hotspots detected in The Cancer Genome Atlas Before neural induction, three independent replicates of hiPSC from each donor line were generated and cultured for 1 week as described above. Healthy, nondifferentiating hiPSCs colonies were dissociated into single-cell suspension using TrypLE Express Enzyme (ThermoFisher Scientific, 12605010) and plated on Vitronectin FX-coated plates at 50,000 cells/cm 2 density in the presence of RevitalCell Supplement (ThermoFisher Scientific, A26445-01, lot 2170092). The cells were cultured for another 2 days until they reached 60-75% confluence. On day 13, the medium was changed to NDM without Y-27632, and the cells were allowed to differentiate for another 14 days. Two thirds of the medium was changed three times a week.
During the cell differentiation process, cell pellets from all culture replicates were harvested at days 0, 6, 12 and 27 (endpoint) for an RNA-sequencing serial time study. Immunostaining characterization was performed at days 0, 12 and 27 of differentiation to assess the differentiation efficiency.
Total RNA was extracted using PureLink RNA Mini Kit (ThermoFisher Scientific, 12183018 A) following the manufacturer's recommendations. The RNA was quality controlled; the cDNA libraries were prepared and sequenced using Illumina NovaSeq 6000 technology. Each sequenced sample had ≥20 million read pairs of 150-bp paired-end reads.
Processing RNA-sequencing data. Splice-aware STAR v2.5.0a 76 was used to map RNA-sequencing data to the reference genome. For the human decoy reference genome hs37d5.fa.gz, a genome index was first generated. Then, using the splice junction information from Gencode GTF annotation file v19, fastq files were mapped. The fragments of reads linked with the gene features were then counted using featureCounts v2.0.1 (ref. 77 ). The samples' raw counts matrices were then analyzed in R version 4.0.4. Differential gene expression was performed using the DESeq2 R package.
NSCs at day 12 and neurons at day 27 of differentiation were stained as already described before with minor modifications 78 . Briefly, the medium was discarded from the plates, and the cells were rinsed gently with DPBS. A 4% solution of paraformaldehyde was used to fix the cells for 20 min at room temperature. The cells were rinsed twice with DPBS and permeabilized for 20 min with 0.1% Triton X-100 (Sigma-Aldrich, T8787-50ML). Nonspecific epitopes were blocked with 0.5% BSA solution for 1 h at room temperature. Cells were incubated overnight at 4 °C with the primary antibodies as follows: on day 12, cells were incubated with an anti-PAX6 antibody (ThermoFisher Scientific, 14-9914-82, dilution 1:100); on day 27, cells were incubated with an anti-Tubulin beta III (TUBB3) (Millipore, MAB1637, dilution 1:400). The cells were then rinsed three times with 1× DPBS and incubated with the secondary antibody Alexa Fluor 488 donkey anti-mouse (ThermoFisher Scientific, A21202, dilution 1:500). The cells were rinsed three times with DPBS and the nuclei counterstained with NucBlue Fixed Cell Stain ReadyProbes (ThermoFisher Scientific, R37606). The images were acquired within 48 h using EVOS FL Auto 2 microscope (ThermoFisher Scientific, AMAFD2000), and the figures were made using the FigureJ plugin in ImageJ software.
Statistics and reproducibility. All statistical analyses were performed in R 79 . The effects of age and sex on mutation burden of F-hiPSCs were estimated using Mann-Whitney test, 'wilcox.test()' in R. Tests for correlation in the study were performed using 'cor.test()' in R.
For cancer driver mutations identified in HipSci F-hiPSCs, a two-sided Fisher test was used to call a mutation as a significant change in allele frequency between the fibroblast and iPSC samples (Supplementary Table 5). A Benjamini-Hochberg procedure for multiple hypothesis testing was used.
For the differentiation and immunostaining experiments, each BCOR-mut and BCOR-wt cell line had three independent biological replicates differentiated, and for each of these replicates, three wells were stained and imaged using immunofluorescence. At every stage of the neural differentiation (day 0, day 12 and day 27), a total of 36 images were analyzed for both BCOR-mut and BCOR-wt cell lines.
Differential gene expression analysis of Insignia B-hiPSCs was performed using DESeq2, which fits each gene's negative binomial generalized linear model. The default DESeq2 Wald test was used for significance testing, and a threshold of <0.05 of the adjusted P value was applied (Supplementary Table 13).
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
All datasets generated as part of this study are included in this data availability statement with no omissions. Links to raw sequencing data produced in this study are available from the HipSci project website (WGS data: https://www.hipsci.org/ lines/#/assays/wgs; WES data https://www.hipsci.org/lines/#/assays/exomeseq

Extended Data Fig. 8 | Comparison of WeS (72X) and high-coverage WeS (hcWeS, 271X) of fibroblasts.
More mutations in hiPSCs were discovered in fibroblasts (shared mutations) through high-coverage Whole Exome Sequencing (hcWES) than through WES, resulting in the percentage of shared mutations increased in hcWES data. Interestingly, the mutational profile of these increased shared mutations is very similar to the UV signature, indicating that increasing sequencing depth enables more UV-caused somatic mutations that were found in hiPSCs to also be detected in the corresponding fibroblasts.