Need for high-resolution Genetic Analysis in iPSC: Results and Lessons from the ForIPS Consortium

Genetic integrity of induced pluripotent stem cells (iPSCs) is essential for their validity as disease models and for potential therapeutic use. We describe the comprehensive analysis in the ForIPS consortium: an iPSC collection from donors with neurological diseases and healthy controls. Characterization included pluripotency confirmation, fingerprinting, conventional and molecular karyotyping in all lines. In the majority, somatic copy number variants (CNVs) were identified. A subset with available matched donor DNA was selected for comparative exome sequencing. We identified single nucleotide variants (SNVs) at different allelic frequencies in each clone with high variability in mutational load. Low frequencies of variants in parental fibroblasts highlight the importance of germline samples. Somatic variant number was independent from reprogramming, cell type and passage. Comparison with disease genes and prediction scores suggest biological relevance for some variants. We show that high-throughput sequencing has value beyond SNV detection and the requirement to individually evaluate each clone.

Genetic variants influence cellular mechanisms, thus leading to specific phenotypic presentations in the organism, both in rare and common disease. Neurological disorders like Parkinson's disease (PD) typically comprise both rare and common genetic risk variants with large and small effect sizes, respectively. Studying the pathomechanism in patient cells is often limited because the disease relevant tissues are not accessible. Human embryonic stem cells (ESC) can be differentiated into cells from all three germ layers (endoderm, mesoderm, ectoderm) but pose legal and ethical issues. In contrast, induced pluripotent stem cells (iPSCs) can be derived from adult tissues using exogenous expression of four transcription factors (POU5F1, SOX2, KLF4, MYC) and can be differentiated into somatic cells in vitro [1][2][3] . Human iPSCs promise not only easy access to cells for scientists interested in disease modelling but also personalized medicine for patients affected by rare diseases.
While different protocols (non-/integrating viral, non-integrating non-/viral) for the generation of iPSC lines have been established, quality control (QC) during reprogramming, differentiation and culturing steps remains an area of active development 4 . Loss of genetic integrity as a source of variability in iPSCs 5 and in therefrom derived cells is a possible confounder compromising their validity as disease models. Certain genetic variants could be associated with increased risk of cancer or dysfunction when using these cells for regenerative therapeutic interventions. Indeed, tumorigenicity has been reported in transplanted stem cells 6 , and a recently published clinical trial using autologous iPSC derived retinal cells 7 was temporarily halted due to concerns of tumorigenic

Characteristics of individuals included and iPSCs generated in the ForIPS biobank resource.
The ForIPS study (Fig. 1A) included 23 individuals (11 females and 12 males) of which 9 individuals (5 females, 4 males) were healthy controls without any neurologic disease (CT), 14 were patients affected (AP) by one of three neurological diseases: PD (1 female, 8 males), hereditary spastic paraplegia (HSP, SPG11 gene, OMIM #604360 and *610844; 3 females), monogenic intellectual disability (ID; 2 females). The age at donation of fibroblasts ranged from 22 to 73 years (y) with a median of 45y. In CTs the age range was 23 to 70y with a median of 45y, and in APs the age range was 22 to 73y with a median of 45.5y. The oldest subgroup included individuals with PD (age range 36 to 73y, median 54y). Nine individuals were members of 4 families: "J2C" and "1JF" are father and son, "88H", "O3H" and "82A" are siblings, "PT1" and "CT1" are siblings and "55O" and "G7G" also are siblings ( Fig. 1B; see also Fig. S1 and File S1).
In the iPSC lines derived from fibroblasts, pluripotency was confirmed by positive staining for POU5F1 and NANOG for all iPSC lines, and fluorescence-activated cell scanning (FACS) analysis for TRA-1-60 was positive for >90% of the cells in each line (Fig. S1). All fibroblasts and iPSC lines generated in the ForIPS consortium, which passed these pluripotency criteria were sent to genetic QC. A cell suspension from each culture was subject to an initial integrity screening ( Fig. 1A: "step 1") using conventional karyotyping to detect aneuploidies and larger chromosomal aberrations. In this first QC step ~15% of iPSC cultures were discarded due to significant chromosomal aberrations ( Fig. S1 and File S2). In addition, DNA-based fingerprinting (PowerPlex assay) was employed to verify sample identity in most samples or was replaced by CMA based fingerprinting ( Fig. S1; File S2). Three iPSC lines did not match DNA from donor fibroblasts and were excluded from further analysis. For the remaining lines, fingerprinting matched with the respective fibroblast and with the reported donor sex. Samples which passed the first QC step were included into our subsequent studies ( Fig. 1B; File S1). This group included 72 primary iPSC lines with a median number of 3 iPSC lines per individual (range 2 to 6) and a median passage number of 14 at time of analysis (range 2 to 39). Forty-nine of these iPSC lines were generated by using integrating retroviral reprogramming (RiPSC) and 23 lines using non-integrating Sendai reprogramming (SiPSC) Yamanaka transcription factors 2,16 . RiPSC had a higher median passage number of 15 (range 2 to 39) at analysis compared to 5 for SiPSCs (range 3 to 15). Four RiPSC lines from two individuals ("AY6", "82A") were differentiated into midbrain neuronal progenitor cells 1 (NPCs) and had a median passage number of 7.5 (range 5 to 13). To investigate the relationship between passage number and somatic variants, four RiPSC lines from the same individuals were cultured to higher passages of 30 and 40, respectively. Detection of somatic CNVs by high density SNP-based CMA. In a second analysis step all study samples passing step 1 (23 fibroblast cultures, 49 RiPSCs, 4 hereof derived NPCs, 4 RiPSCs at passages 30 and 40, and 23 SiPSCs) were screened for CNVs with a high-resolution, single-nucleotide polymorphism (SNP)-based chromosomal microarray (CMA). We used the Affymetrix CytoScan HD array as it is an established and reliable tool in routine germline diagnostics at our Center for Rare Diseases 17,18 . Array QC measures passed manufacturer recommended thresholds in 97.2% of analyzed samples (105/108). The CMA data for the other three samples were only marginally below these thresholds and after manual review considered to be of sufficiently good quality (File S2; Fig. S3). The CMA for each analyzed culture was visually screened by a trained expert (M.K.) for aberrations ≥100 kilobases (kb) and absent from donor fibroblasts (Supplementary information). We identified a total of 93 sub-chromosomal CNVs with sizes ranging from 100 kb to 6.4 Mb (megabases) including 48 deletions and 45 duplications (Fig. 2). Most aberrations (91/93) were smaller than the lower detection limit of 5 to 10 Mb typically assumed for G-banded karyotyping 19 . In addition, we observed trisomy of chromosome 12 in three RiPSC cultures ("i1JF-R1-002", "i1E4-R1-012", "i1E4-R1-016"), twice only present in a sub-population of cells. In the SiPSC line "CT1-S1-010" we detected a copy number gain affecting all terminal markers on chromosome 17q. Despite its size of 5.9 Mb this CNV was not detectable by conventional karyotyping. The chromosomal position indicated the possibility of an unbalanced translocation which was confirmed by fluorescence in situ hybridization (FISH) analysis as a 14p/17q unbalanced translocation probably of somatic origin (Fig. 3A,B) karyotyping and CNV analysis based on intensity data of chromosome 9 showed unremarkable results in SiPSC line "i82A-S1-004", SNP allele peak distribution uncovered a copy neutral allelic imbalance on the long arm of chromosome 9 indicating a ~30% sub-clonal cell population carrying a partial uniparental isodisomy (Figs 3C and S2). Next, we compared RiPSCs and SiPSCs to reveal method-specific differences: 58 somatic CNVs were detected in 34 of 49 (69.4%) RiPSCs, and 35 somatic CNVs in 17 of 23 (73.9%) SiPSCs. Only 15 of the RiPSCs (30.6%) and six SiPSCs (26.1%) showed no somatic CNVs. CNV size varied between 106 kb and 6.4 Mb in RiPSC, and between 100 kb and 5.9 Mb in SiPSC lines. The number of affected genes based on Genbank annotation varied between 0 and 139 with a higher variability in RiPSC lines. Three aberrations in RiPSC contained no genes, whereas all aberrations in SiPSC included genes. Our data showed no significant differences regarding number, size and gene content of somatic CNVs between RiPSC and SiPSC clones, indicating a comparable genetic cell quality ( Fig. 2A-C). Also, there was no significant difference between sexes, relatives-and affected-status confounding the analyses (Fig. S3). Step 1: genetic fingerprinting and conventional karyotyping.
Step 3: exome sequencing. (B) Graph showing the age distribution (x-axis) and phenotype of all donors. Fibroblast cultures are plotted as symbols (white = unrelated individuals; blue, red, green = related individuals) on the grey timeline (male = square; female = circle). The three-letter codes in these symbols represent each individual's donor IDs (see also Fig. S1C). The passage of the derived RiPSC (above) and SiPSC (below) cultures are plotted as circles connected to the respective fibroblast (y-axis; scattered for visualization). Derived NPCs are connected to the RiPSC they originated from. Red bars below the fibroblast symbols mark individuals with PBLs available selected for exome sequencing. See also File S1 for additional information. (C) Standardized nomenclature for variants/aberrations depending on the cell they arose in. The scheme compares the evolutionary history of a cancer cell (box "selection") which is subject to a strong selective pressure with that of a cultured cell (box "genetic drift") which is mainly subject to random genetic drift. In four RiPSC clones cultured to higher passages we could not observe any CNV differences during passaging (File S3), and the average somatic CNV number aggregated per individual showed no correlation with passage number (Fig. 2D). Additionally, the somatic CNV count was not correlating with the probands' age at the time of biopsy (Fig. 2E).
NPCs showed the same CNVs detected in the corresponding RiPSC clones indicating genetic stability during differentiation ( Fig. 2A-C; File S3). In the NPC culture derived from the RiPSC "i82A-R1-001" we observed two previously fixed CNVs which had lower intensities in the NPC compatible with a ~50% sub-population: A somatic deletion affecting the DLG2 gene and a deletion affecting the genes VCX and PNPLA4 (Fig. S2). This observation shows that the RiPSC culture was initially oligoclonal, and points to either selective pressure of culture conditions or random genetic drift introduced by manual picking as the cause of the allelic shift in this NPC culture.
Although the identified somatic CNVs were scattered throughout the genome (Fig. 2F), we detected three regions representing possible, specific hotspots. First, two overlapping deletions affecting the CTNNA3 gene in 10q21.3 were identified in a RiPSC clone of individuals "88H" and "O3H", respectively (Fig. 3D, File S3). Second, three aberrations within the DLG2 gene were detected: two overlapping deletions in the iPSC clones "i82A-R1-001" and "i82A-R1-002" of "82A" as well as a duplication in the SiPSC clone "iK22-S1-001" of "K22" (Fig. S2). Many smaller and overlapping aberrations in both regions were observed in healthy control individuals (Database of Genomic Variants 20 ). Furthermore, a mosaic gain in 20q11.21 including the BCL2L1 gene was revealed in two different RiPSC clones of "PX7" and one clone of "1JF". Exome sequencing comparing iPSC and germline donor material to detect SNVs/indels. We selected a subset of samples for comparative exome sequencing with following inclusion criteria: (1) Availability of a germline DNA sample of the donor (blood) which was not a direct progenitor of the cultured cells (fibroblasts).
(2) Availability of SiPSC, RiPSC and differentiated NPC lines of the same donor. (3) Access to higher passage samples of the lines. (4) Different affected status, age and sex. As the individuals "AY6", "PX7", "88H" and "82A" met these criteria, we selected a total of 34 samples (4 blood, 4 fibroblast, 8 RiPSC, 4 RiSPC passage 30, 4 RiSPSC passage 40, 6 SiPSC, 4 NPC). Exome sequencing on an Illumina HiSeq2500 machine and standard preprocessing resulted in aligned BAM files (Supplementary information) with a median on-target coverage of 163× (range 117x to 264x) and ≥95% of the exome target being covered by at least 20 reads (File S2). Based on an initial feasibility test run with six exomes (Files S1 and S4; Fig. S4; Supplementary information) and previous experience from pooled 21 and somatic variant calling 22 , we used the freebayes software 23 , which simultaneously calls all classes of small nucleotide variants (SNVs = single nucleotide variants, MNPs = multiple nucleotide polymorphisms, indels = small insertions/deletions; when not specifically stated we use the term SNV/ indel for all classes of small variants). All 34 exome samples were called together with 53 in-house controls from the same machine runs with freebayes and resulting variants were annotated with SnpEff 24 . From here on we describe somatic variants obtained after applying hard filters to exclude variants with read evidence in the blood samples (Supplementary information; File S4). We considered resulting variants with alternate allele fractions (AF) ≥30% as fixed somatic and variants with AF <30% as low frequency somatic variants (File S4 and Fig. S4). We identified a median of 38 fixed (minimum 17, maximum 256) and 1651 low frequency (minimum 739, maximum 3988) somatic SNVs/indels per sample in the coding target regions. We only report the results for the fixed variants and did not perform orthogonal validation (e.g. deep amplicon sequencing or digital PCR) for the low frequency somatic variants (see Fig. S4) as previously analyzed by others 13 .
In analogy to the CNV analysis, we investigated SiPSC, RiPSC and NPC exome data for reprogramming or differentiation specific effects. No significant differences were detected for somatic SNV/indel numbers between RiPSC and SiPSC clones or between RiPSC and their derived NPCs (Fig. 4A,B). Notably, the variance was higher for RiPSC (Fig. 4A), an effect resulting from specific cultures (compare Fig. 5A,B) with a much higher variant load.
Like for CNVs, we found no correlation between somatic SNV/indel variant load and passage number (Fig. 4C). In contrast to the CNV analysis, the somatic SNV/indel count aggregated per individual showed a strong positive correlation with the probands' age at the time of biopsy (Fig. 4D). However, this observation is influenced by above mentioned iPSC cultures from older donors (Fig. 5). Next, we analyzed specific properties of the identified somatic SNVs/indels. Variants predicted to have a moderate impact on gene function (mainly missense variants) represent the largest proportion of identified somatic variants (range 35% to 69%) per sample (Fig. 5A). In most iPSC samples, somatic variants were mainly SNVs, with only a small portion of indels and MNPs identified. However, four samples showed an unusual high Box-and scatterplot comparing the total number of fixed somatic SNVs/indels in independently reprogrammed SiPSC (n = 6) and RiPSCs (n = 8) from four donors ("82A" = grey, "88H" = orange, "AY6" = blue, "PX7" = green). (B) Box-and scatterplot comparing the total number of fixed somatic variants in RiPSC and derived NPCs from donors "82A" (grey) and "AY6" (blue). No significant differences were detected neither for somatic SNV/ indel numbers between RiPSC and SiPSC clones nor between RiPSC and their derived NPCs (two sided Wilcoxon signed-rank test). Certain cultures have a much higher variant load ("82A" = grey, "88H" = orange). NPCs have the same variant profile as their progenitor cells. (C) Number of variants in four RiPSC lines ("i82A-R1-002" = grey, "i82A-R1-001" = yellow, "iAY6-R1-003" = blue, "iAY6-R1-004" = red) from donors "82A" and "AY6" cultured to higher passages vs. passage number. Diamonds mark the respective average SNV/ indel count grouped by cell culture passage number (low passage numbers between 7 and 15 are considered as one group) intersected by a standard error bar. proportion of MNPs (Fig. 5B). A closer examination of these samples (File S4) showed that the MNPs are mainly CC > TT dinucleotide mutations at dipyrimidines and that they additionally had an increase in C > T/G > A transitions (Fig. 5C), both mutational signatures typical for ultraviolet light (UV) irradiation damage 25 . Missense variants represented a large part of the identified somatic SNVs in the iPSC cultures. Compared to truncating variants their functional interpretation is difficult. We used different computational prediction scores to assess their potential pathogenicity. Interestingly the scores obtained for a large portion (CADD: 44,1%, M-CAP: 35,0%, REVEL: 12,6%) of these somatic missense SNVs are above the respective recommended pathogenicity thresholds (Fig. 5D) [26][27][28] .
Our exome study design with concurrent sequencing and analysis of blood germline and parental fibroblast culture samples enabled us to search for evidence of low frequency somatic variants in fibroblasts due to polyclonality ("somatic mosaicism"). While low frequency variants in bulk sequencing data are inherently noisy when analyzed alone, prior knowledge of a fixed variant in a descendent culture sample increases the locus specific probability of low frequency reads being bona fide somatic variants 13,29 . Accordingly, the allele fraction (AF) for fixed variants in the analyzed iPSC cell cultures followed an expected normal distribution of around 0.5, while most of the variants with read evidence in the fibroblasts had a lower AF. In addition, variants at the lower coverage tails had a larger variance in AF influenced by random sampling (Figs 5E and S4). We found a correlation between read coverage at somatic variant positions in the iPSC cultures and AF in the corresponding fibroblast culture, indicating that somatic variants at low AF can only be found in the fibroblast if sufficient read coverage is available. Using a simple binomial draw model, we demonstrate that most variants potentially identifiable as being present in the fibroblasts (somatic) indeed do have reads supporting them (Fig. 5F). It is likely that the remaining somatic variants are still somatic but only present at a very low AF in the original fibroblast culture and that they were just not detectable by bulk exome sequencing 13 .
Multiple secondary analyses revealed additional iPSC culture characteristics. While the mitochondrial genome ("chrM") is not targeted in most commercial exome designs, exome data still contain considerable mitochondrial coverage due to their high copy number in each cell. We calculated the average coverage of chrM (median 263x, minimum 66x, maximum 765x) and normalized it to the coverage of chromosome 1 (File S5). Fibroblast and R/SiPSC cell cultures showed a significantly higher mitochondrial genome dosage than NPC cultures and peripheral blood lymphocytes (PBLs) (Fig. 6A). Likewise, telomeric genomic regions are not targeted in exome designs but have a high relative coverage in the genome. We used two recently described software algorithms (telomerecat 30 , telomerehunter 31 ) to compute the relative telomere content from exome data and to correlate it with the passage numbers. While the estimates from both algorithms showed a trend towards less telomere content in higher passages, these results were not significant (Fig. 6B). It should be noted that the telomeric content of the 53 in-house exome controls used, when correlated with age, also showed a non-significant trend (Fig. S5).
In our initial exome variant calling test in RiPSCs we identified variants in the POU5F1 gene locus absent from the parental fibroblast. These were confirmed to be single nucleotide variants from the integrated viral vector (Fig. S6). We therefore excluded the genomic regions of all transcription factors used for reprogramming from variant calling (Supplementary information). When examining these regions, we noticed the coverage profile of the RiPSCs having sudden breaks at the exon-intron boundaries like the profile seen in RNAseq. In contrast, fibroblasts and SiPSCs show bell-like shapes over the capture probes, which is typical for capture-based enrichment (Fig. 6C). Our observation indicated multiple genomic integrations (Fig. S6) of the plasmid with intron-free transcription factor inserts used for reprogramming of the RiPSC lines.
We wondered whether algorithms for CNV detection from exome data could replace or supplement the widely accepted CMA analysis. The CNVkit algorithm 32 uses intergenic reads to achieve a more uniform marker coverage across the genome. While several CNVs detected previously by CMA were also called from exome data using this software, several others were missed (Figs 6D and S6; File S3).
Off-target reads can also be used to check sequencing data for DNA of microorganisms like mycoplasma or cross-individual contamination. We used the MinHash based BBSketch algorithm (https://jgi.doe.gov/ data-and-tools/bbtools/) to screen our exome files for cell culture contamination but did not find any evidence for high-grade contamination ( Fig. S5; File S5). Similarly, we could exclude significant cross-individual contamination, a known problem in iPSC cultures 33 using the ContEst 34 software (Fig. S5; File S5).

Discussion
Since the discovery of reprogramming methods for somatic cells into pluripotency, the stem cell field has rapidly progressed 2,3 . Precise disease modelling and personalized treatment are some of the promises the iPSC technology is beginning to fulfill 7 . Though advances are increasingly encouraging, there is still considerable heterogeneity in research practices 4,35 . This is especially evident in genetic QC, which in recent years only has received systematic attention in large cohorts 5,9 . Despite a wealth of available experience from pioneering genetic fields regarding rare diseases or cancer genetics, the community has not yet agreed upon common minimal standards for an iPSC line to be acceptable as a model and to be safe for therapeutic use. Here, we describe the application of diagnostic grade technologies to ensure genetic integrity for a collection of iPSCs and differentiated progeny cells from the ForIPS consortium. Of the 72 primary iPSC lines presented here 61 were generated for the core ForIPS project (Parkinson's disease) and 30 of these (49.2%) have been distributed to subprojects for functional analyses at the time of the final project report.
We confirm the minimal standard of conventional karyotyping and genetic fingerprinting. G-banded karyotyping led to the exclusion of an appreciable proportion of cell lines with numerical chromosomal anomalies, at a comparable frequency with other reports 36 but also large structural chromosomal rearrangements, which are quite frequent in iPSCs (Figs 3A,B and S1; File S2). While this technique is considered relatively cheap, it requires a lot of hands-on work and does not produce results in a computable electronic form. CMA analysis for copy number aberrations can also identify aneuploidies. However, chromosomal rearrangements in a balanced state would be missed (Figs 3C and S1). Some groups perform optical mapping as an alternative screening method 14 . Despite its currently higher costs and the need for specific DNA extraction methods, its higher resolution and computational accessibility might make optical mapping a method of choice for structural aberrations. Also, genetic fingerprinting proved to be a valuable first line QC step which allowed us to resolve sample mix-ups. While short tandem repeat (STR)-based methods, like the one we used, are widely employed for identity testing, these do not allow sample tracking in a complete genetic pipeline. A single nucleotide polymorphism based profiling panel for sample tracking 37 would likely be more valuable for biobanks.
Our results using high density CMA showed that about 70% of iPSC lines have a detectable somatic CNV ≥100 kb, independent of the reprogramming method used ( Fig. 2A-C). This fraction is higher than in previous large reports 5,9,38 , which can be attributed to variable CMA resolution and differences in filtering and analysis between the studies. Indeed, a smaller study using the same CMA platform we chose, did also find CNVs in a  analyzed, independent of the reprogramming method used (Fig. 4A). Interestingly, every primary iPSC line had at least one fixed somatic high impact (truncating) SNV/indel and several somatic missense variants of which a large portion was predicted as damaging to the protein function by different computational scores (Fig. 5A,B,D). Several of the identified somatic variants affect genes implicated in cancer or monogenic diseases as well as genes with elevated expression in the brain (Table 1). These findings are well in line with previous reports 12 . Our results suggest a functional impact of certain somatic variants in the iPSC lines. Together with the high variability in somatic variant load observed for all variant classes (Figs 2A and 4A), even in isogenic lines, these observations signify that each line must be individually assessed before use in downstream experiments or therapeutic applications. In addition, we found no significant differences between integrating and non-integrating reprogramming methods regarding somatic CNVs ( Fig. 2A,B) and SNV/indel (Fig. 4A) counts, thus supporting a recent publication for SNVs/indels 14 . This information is of special value to researchers working with established RiPSC lines. The relationship between culture passaging and somatic variants count has been controversially discussed in the literature. While early analyses have described a negative correlation between CNV count and passaging 42 , recent studies using low resolution CMA 5,12 or whole genome sequencing 13 could not confirm this. Furthermore, an older study showed an increase in coding SNV counts from 7 to 13 for a single analyzed iPSC line between  Table 1. Fixed variants with predicted loss-of-function effect in known cancer associated genes according to the COSMIC database (CGC), known disease genes (OMIM) or genes highly expressed in the brain according to the Human Protein Atlas (HPA). Inh., inheritance mode ("AD": autosomal dominant, "AR": autosomal recessive, "XLR": X-linked recessive); HGVS, Human Genome Variation Society nomenclature ("c. ": coding DNA change, "p. ": protein change; "p.?": consequence of the variant at protein level cannot be predicted without further functional assays); OMIM-G, OMIM (https://omim.org/) gene number; OMIM-P, OMIM phenotype number; CGC, COSMIC cancer gene census 56 gene list; HPA, human protein atlas 57 brain elevated gene set (File S6); pLI, probability of loss-of-function intolerance 58 ; "na": not available.  43 , Our results do not support a strong effect of passaging on either CNV or SNV/indel counts (Figs 2E and 4C). The four NPC lines differentiated from RiPSCs in our study showed no additional CNVs ( Fig. 2A-C) and have not significantly acquired SNVs/indels during differentiation (Fig. 4B). Together, these data argue against a strong effect of passage number on somatic variant count. Based on increasing numbers of somatic CNVs in aging individuals as demonstrated in cancer studies [44][45][46] , one would expect to find higher frequencies of this mutation type in iPSCs derived from older donors. Our results, however, demonstrated no significant correlation between donor age and somatic CNV count, confirming similar recent reports 5,12 . In contrast to CNVs, somatic SNV/indel load in exome regions has been shown to linearly increase with donor age in iPSCs derived from peripheral blood mononuclear cells 12 . We also confirm this observation in our iPSC sample collection derived from skin fibroblasts (Fig. 4D). Altogether, our findings and the descriptions in the literature point to differences in the mutational mechanisms and cellular processes involved in the formation of somatic CNVs and SNVs/indels. Our results point to UV irradiation damage related somatic sub-clonality in the parental fibroblasts as a source for SNVs/MNPs and inter-culture variability (Figs 4A,D and 5A-C). Recent studies suggest that most variants identified in iPSC, but absent from the donor germline, are already present in a subpopulation of the cells of origin 12,13,15 . We also show extensive somatic mosaicism in the parental fibroblast cultures as a source for fixed somatic variants in iPSCs (Fig. 5F). Considering the data regarding passaging, we propose that random genetic drift induced by colony picking from poly-/oligoclonal cell cultures and not positive selection is a major cause of somatic variation in iPSC clones (Fig. 1C). This model is very different from the typical situation in cancer, where few "driver" mutations pose a strong advantage 47 in an environment of selective pressure, while most "passenger" variants are neutral (Fig. 1C). The goal in iPSC research is not to find detrimental driver mutations but to produce intact cells resembling the donor, thus successful strategies in cancer and iPSC fields will differ.
Mitochondria are crucial for cellular senescence and pluripotency in iPSCs 48 . Differences in mitochondrial morphology, count 49 and mitochondrial DNA (mtDNA) content 50,51 during pluripotent stem cell reprogramming and differentiation have been reported 52 . Our analysis of the mitochondrial genome content showed significant differences between PBLs, iPSCs and differentiated NPCs, but not between fibroblasts and iPSCs (Fig. 6A). A similar method for relative quantification of mtDNA from exome data has recently been compared to gold standard methods 53 . These data highlight the added value of high-throughput sequencing reads for complementary analyses with potential use in iPSC characterization. The application of our method in large studies will likely expand our current knowledge of mitochondrial function in iPSCs and their progeny. Our exemplary attempts to telomere content analysis, viral integration and CNV analysis from exome data show that these analyses are in principal possible but need further evaluation and calibration (Fig. 6B-D). Albeit applicable to exome data, most of the described techniques will likely lead to better results using whole genome sequencing data.
In conclusion, we applied high-resolution diagnostic methods in a systematic pipeline to ensure genetic stability of iPSCs generated in the ForIPS consortium and confirmed several previous associations in an iPSC collection from diverse donors. Most importantly, we showed that different clones have a high variability regarding somatic variant load. Based on our findings, 46/72 (63.9%) primary iPSC lines from the ForIPS study could be recommended for research distribution considering karyotype and CMA. This highlights that the genetic evaluation of each individual iPSC clone is fundamental prior to its use as model or for therapeutic purposes. A combination of karyotyping by optical mapping, CMA and exome sequencing will likely provide the best combination regarding cost and efficiency in the next years. From the primary iPSC lines with additional exome sequencing presented here, 6/14 (42.9%) could be recommended considering the exclusion of cell lines with a high impact (truncating) and fixed variant in genes involved in monogenic diseases, cancer or highly expressed the brain (Table 1 and File S1). As even the smallest variant classes can have detrimental effects on important genes, we recommend an inspection of all iPSCs based on three pillars: karyotyping for balanced aberrations, CMA for CNV detection, and NGS to search for SNVs/indels. Starting with three iPSC lines and considering only karyotyping and CMA one would have a chance of ≥90% (binominal probability: 1-(1-0.639) 3 ~ 0.953) to end up with least one iPSC line passing these two QC steps. However, when also considering exome sequencing one would already need eight starting iPSC lines for a chance of ≥90% (binominal probability: 1-(1-0.639 * 0.429) 8 ~ 0.923) to have at least one iPSC line passing all three QC steps. Ideally these analyses should be performed on the initial iPSC cultures in comparison to an independent germline sample to find the best iPSC line before using these for experiments and again on later derivatives to ensure validity of functional results before publication. Future work will have to determine an optimal cost-benefit ratio in large biobanks.

Methods
Inclusion of subjects in the ForIPS resource. The ForIPS research consortium (http://forips.med.fau.de/) has established an institutional iPSC biobank resource to explore diseases of the brain, particularly PD. All reported iPSC lines with adequate consent have been registered in hPSCreg 54 . To exchange selected lines for research purposes the scientific board of the UKER biobank will consider each request.
Twenty-three individuals were recruited at the Department of Molecular Neurology (Universitätsklinikum Erlangen). All individuals were phenotypically examined by a clinician experienced with neurological diseases. PD patients were diagnosed by board-examined movement disorder specialists according to consensus criteria of the German Society of Neurology, which are similar to the UK PD Society Brain Bank criteria for diagnosis of PD 55 . Age at tissue donation, gender, ethnicity and family history were assessed. All participants gave written informed consent to the study prior to donating a skin biopsy from a typically sun unexposed area of the inner upper arm. From this biopsy, a fibroblast stock culture was created. Four individuals additionally donated PBLs for an independent germline DNA sample (Fig. 1A). Symptomatic individuals had targeted genetic testing to exclude or confirm monogenic forms of PD, HSP and ID (see Supplementary information). Study approval including all iPSC procedures was granted by the local ethics committees (No. 4485 and 4120, FAU Erlangen-Nuernberg, Germany; and No StV I 1/09 Canton of Zurich) and all participants or their legal guardians gave written informed consent prior to inclusion into the study. All related experiments and methods were performed in accordance with relevant guidelines and regulations.
Reprogramming, differentiation, culture conditions and genetic QC. Detailed methods used for generation of iPSC, differentiation of NPCs, cell culture conditions and for the genetic QC analyses performed are described in the Supplementary information.

Data Availability
The consent and ethics approval for the ForIPS study does not cover the deposition of identifiable germline genomic data of study participants into public repositories. We follow the DFG (German Research Foundation) recommendations for safeguarding good scientific practice and thus internally archive all data for this study. We provide file checksums for all primary array and sequencing data (File S2). These shall be accessible for any legitimate request from the corresponding author (A.Re.). With future consent updates we plan to submit this genetic data to public repositories.