The Molecular Karyotype of 25 Clinical-Grade Human Embryonic Stem Cell Lines

The application of human embryonic stem cell (hESC) derivatives to regenerative medicine is now becoming a reality. Although the vast majority of hESC lines have been derived for research purposes only, about 50 lines have been established under Good Manufacturing Practice (GMP) conditions. Cell types differentiated from these designated lines may be used as a cell therapy to treat macular degeneration, Parkinson’s, Huntington’s, diabetes, osteoarthritis and other degenerative conditions. It is essential to know the genetic stability of the hESC lines before progressing to clinical trials. We evaluated the molecular karyotype of 25 clinical-grade hESC lines by whole-genome single nucleotide polymorphism (SNP) array analysis. A total of 15 unique copy number variations (CNVs) greater than 100 kb were detected, most of which were found to be naturally occurring in the human population and none were associated with culture adaptation. In addition, three copy-neutral loss of heterozygosity (CN-LOH) regions greater than 1 Mb were observed and all were relatively small and interstitial suggesting they did not arise in culture. The large number of available clinical-grade hESC lines with defined molecular karyotypes provides a substantial starting platform from which the development of pre-clinical and clinical trials in regenerative medicine can be realised.

. List of 38 EUTCD compliant (clinical-grade) hESC lines. The 25 hESC lines analysed by SNP analysis are shown in bold. 1 All eight (8) of the KCL hESC lines are listed on the NIH Stem Cell Registry (escr.nih.gov) making them available for NIH-funded projects in the USA.
Scientific RepoRts | 5:17258 | DOI: 10.1038/srep17258 preimplantation genetic diagnosis 40 . However, the vast majority of blastocysts, including all within the UK, are donated under conditions that prohibit access to parental DNA. An alternative method to determine if a particular CNV observed in a hESC cell line might be parent-of-origin is to compare it to known CNVs present on the Database of Genomic Variants (DGV) 41 . The DGV has catalogued a collection of published CNVs from over 14,300 healthy individuals. Although not exhaustive, the collection is highly curated and covers a significant number of CNVs and other genomic structural variants known to exist in the general population. However, if a hESC CNV is found to be present on this database it does not exclude the possibility that it is a de novo genomic alteration that arose during development of the blastocyst or during establishment and maintenance of the cell line.
As well as CNVs, copy-neutral loss of heterozygosity (CN-LOH) represent another form of genomic structural variation characterised by a stretch of homozygosity along part of a chromosome 42 . If the affected alleles contain recessive mutations or lie within regions of the genome subject to imprinting, there can be either a net loss or a net gain of gene function and expression 42 . CN-LOH regions can also be due to the presence of persistent ancestral recombination 'cold spots' or be the consequence of recent consanguinity 43,44 . While these changes would be considered parent-of-origin if found in hESC lines, there are examples of somatic or acquired CN-LOH regions found during the progression of many cancers, particularly those of haematopoietic origin 45 . Whole genome SNP arrays are useful to detect CN-LOH events instead of regions, and regions greater than 1 Mb in length warrant further investigation 42 .
In this study, we sought to examine the genetic integrity of 25 clinical-grade hESC lines utilising whole-genome SNP genotyping analysis. While karyology is sufficient for establishing genetic normalcy within current regulatory standards, advances in technology and an increasing cytogenetic knowledge base demand higher resolution investigation of cell lines and cell products designed for clinical use.

Results and Discussion
Human ES cell lines were cultured in feeder-free conditions prior to the isolation of genomic DNA (Fig. 1). The DNA from 25 hESC lines (Table 1) was assayed for single nucleotide polymorphisms (SNPs) using the Illumina HumanCytoSNP-12 array and data was analysed for large CNVs with GenomeStudio and KaryoStudio software. Our SNP analysis of 25 clinical-grade hESC lines found 15 unique CNVs greater than 100 kb and 3 CN-LOH regions greater than 1 Mb in size among 16 hESC lines, with results summarised in Table 2. Nine clinical-grade hESC lines did not harbour any structural genomic variants of this size. The percentage of cell lines we found to have CNVs greater than 100 kb (72%) is in agreement with the percentage of healthy individuals (~70%) in the population found to harbour CNVs of at least this size 39 . Additionally, the percentage of cell lines with CN-LOH events between 2.5 and 5 Mb (12%) is similar to the percentage of individuals in outbred populations with CN-LOH of this size range 43 . Since we have restricted our search to large structural changes, we are reporting a considerably lower number of CNVs and CN-LOH regions identified in other studies that have examined hESC or hiPSC lines 33,36,46 . Approximately 5%-10% of the normal human genome contains CNVs averaging a few kilobases in length 47 , and high resolution arrays can produce large data sets dominated by such naturally occurring small events 48,49 . Thus, we chose to use the HumanCytoSNP-12 array and KaryoStudio software, tailored to identify CNVs greater than 75 kb and CN-LOH regions greater than 1 MB, a resolution adequate for common molecular cytogenetic interpretation and applicable in a clinically relevant setting 42,[49][50][51] .
Of the 15 unique CNVs detected, 12 were heterozygous duplications, 2 were heterozygous deletions, and 1 was a homozygous deletion ( Table 2). We asked whether these structural genomic variants were  likely to be parent-of-origin CNVs, that is, naturally occurring, or if they could have arisen during the hESC derivation process or during expansion in culture. We first checked each CNV on the DGV (http:// dgv.tcag.ca) to determine whether the CNVs have been previously observed in healthy individuals 41,52 . Amongst the 15 large hESC CNVs, we found 10 had clear evidence of being present in healthy individuals. For example, a duplication of 267 kb on chromosome 6q27 observed in MasterShef3 containing 3 protein-encoding genes-MLLT4, KIF25, FRMD1-was represented on the DGV and has been reported in the healthy population at a frequency of over 1 in 50 individuals ( Fig. 2A) 48,51,53 . RC17 hESCs harboured a single 144 kb duplication on chromosome 12p13.31 encompassing the SLC2A14 and SLC2A3 genes (Fig. 2B). Although this is close to the NANOG locus, we do not believe it confers a growth advantage since this CNV is commonly found (1 in 25) in healthy individuals 39,51,54 . A 132 kb duplication on chromosome 12p11.21 was detected in both KCL033 and KCL040 hESC lines ( Supplementary  Fig. S1). This region does not contain any protein-coding genes, and there are at least 14 submissions of this duplication on the DGV 39,51,53-55 . Man11, MasterShef2, and MasterShef7 also harboured genomic duplications of greater than 100 kb that are represented on the DGV (Supplementary Fig. S1). Man11 harboured a 220 kb gain on chromosome 15q25.3 that has been reported several times 39,51,56 . This duplication contains one gene, AKAP13, and this CNV was not found in the sibling hESC line, Man12. One of the CNVs detected in MasterShef7, a 315 kb duplication present on chromosome 14q21.3, contains a single gene, MDGA, and this CNV is also present on the DGV 51 . A 572 kb gain on chromosome 17q21.31 encompassing 5 genes in MasterShef2 was also found to be present in the normal population at high frequency (9.8%) 53 .
Both heterozygous deletions and the homozygous deletion were found to be naturally occurring in the human population. Two unrelated cell lines, KCL031 and RC9 hESCs harboured the same 120 kb deletion on chromosome 8q24.23 (Fig. 3). This CNV is estimated to have a frequency of about 1 in 26 people and has been reported numerous times to occur in healthy individual 39,48,51,53,55,56 . This region does not contain any protein-coding genes. KCL040 and MasterShef11 also possessed genomic deletions greater than 100 kb that were present on the DGV (Supplementary Fig. S2). KCL040 harbours a previously reported 1.5 Mb deletion on chromosome 16p11.2 that contains 3 related genes 56,57 . MasterShef11 has a 109 kb homozygous deletion on chromosome 19p12 that has been widely reported to occur in healthy individuals (~1 in 9) and does not contain any genes 39,48,51,55,56 .
We identified a novel 2.4 Mb gain on chromosome 5p14.3, containing a single gene, Cadherin-18, that was present in two sibling cell lines, KCL032 and KCL033 (Fig. 4), but not in KCL034, a third sibling line 20,58 . A duplication of this size has not been reported to date, but its presence in two sibling hESC lines strongly suggests it was inherited from one of the parents of the donated blastocysts rather than by acquisition of an identical CNV during hESC derivation and culture.
The remaining 5 CNVs, all duplications, were not fully represented on the DGV. For example, a 516 kb duplication was detected on chromosome 16 in MasterShef7 hESCs that encompassed over 20 genes (Fig. 5). A similarly sized duplication of this region has not been reported to date, but the DGV is not exhaustive and this CNV may represent a novel, but rare naturally-occurring genomic variant. This duplication is not known to confer a selective growth advantage, and has not been reported to be associated with hESC culture adaptation 36 . We also checked this CNV on the DECIPHER database 59 of microdeletion and microduplication clinical syndromes to determine if it was associated with a known disorder. This CNV was not associated with a described clinical syndrome, nor were any of the other 14 CNVs identified here.
The four other unique CNVs that were not fully represented on the DGV were present in KCL034, KCL037, MasterShef3, and Shef6 hESC lines ( Supplementary Fig. S3). KCL034 harboured a duplication of a 331 kb region on chromosome 6 (chr6q22.1) that contained part of the Histone 1 gene cluster. While this region was not fully present on the DGV, it is probable that this gain represents a benign event as other histone clusters have been shown to be preferentially duplicated during evolution 60 . A 542 kb gain on chromosome 18q23 in KCL037 containing two coding genes, SALL3 and ATP9B, has not been previously reported. However, a smaller duplication covering the same two genes has been observed 51 . A 235 kb duplication was detected on chromosome 17 in MasterShef3 hESCs. Although, a duplication of this size is not present on the DGV, four slightly smaller duplications of the region have been reported 39,51,53,55 . A 553 kb gain on chromosome 8p22 in Shef6 hESCs within an intron of the SGCZ gene is an unreported novel structural duplication, but a deletion of this region has been observed 51 . While the 5 novel CNVs detected in our study were not fully present on the DGV, they were also not on the 'ESC-associated' culture-adaptation list of CNVs from the International Stem Cell Initiative survey of 125 different hESC and hiPSC lines 36 . Based on the available evidence, these CNVs likely represent novel, but rare, structural variants found in the human population. However, we do not know the health status of the individuals that may harbour these novel CNVs, so we cannot assume they are benign. Furthermore, we cannot definitively exclude the possibility that these CNVs arose during blastocyst development or during the early stages of hESC line derivation.
In addition to the 15 CNVs identified, our analysis detected 3 regions of CN-LOH greater than 1 MB among three different hESC lines, KCL040, MasterShef5, and RC11 (Table 2 61,62 , the double recombination event required to achieve this would be difficult to explain for sizes less than 25 Mb 63,64 . Since hESCs are known to maintain monoallelic expression in some imprinted regions 65 , we cross-referenced the 3 CN-LOH regions we identified to the Genomic Imprinting database (http://www.geneimprint.com) 66 . None of the CN-LOH regions reside in known imprinted regions, although CCDC85A in the CN-LOH of RC11 is predicted, but not validated, to be an imprinted gene on this database.
While none of the CNV and CN-LOH regions observed in the hESCs appear to harbour genomic anomalies associated with culture adaptation at the passages reported here (Table 2), we have detected the presence of the culture-adapted microduplication on chromosome 20q11.21 at higher passages of 4 clinical-grade hESC lines. This microduplication was reported to be found in ~20% of research-grade hESC lines 36,37 . Thus, it is with prudence that each cell line should be re-evaluated frequently and certainly before the production of any differentiated cell product 67 . These results also illustrate that the heterogeneity of molecular karyotypes in the human population will be reflected in the cell lines produced from human embryos. A perfect genome is unlikely to exist, so an appreciation of human genomic diversity will lend itself to a more measured interpretation of molecular karyotype and genome sequencing data of cell lines destined for clinical use.
Our molecular karyotypic evaluation of 25 clinical-grade hESC lines has established a valuable platform for the development and manufacture of cell therapy products for clinical application in regenerative medicine. Most of the cell samples were provided as frozen cell pellets, which were directly processed for genomic DNA isolation. However, Man11 and Man12, sibling hESC lines, were provided as cryopreserved hESC lines, and Shef6 was obtained directly from the UK Stem Cell Bank (UKSCB Accession No. R-05-031). These three lines were thawed and cultured in Essential 8 media (Life Technologies) on Laminin-521 substrate (Biolamina) for less than 10 passages before pelleting by centrifugation for genomic DNA isolation. Human ES cell morphology and pluripotent marker, NANOG, expression were maintained during this expansion (Fig. 1). NANOG immunostaining was performed with anti-NANOG antibody (1:500) from R&D Systems (cat no. AF1997).

Figure 5. A novel duplication observed in MasterShef7.
A 516 kb duplication on chromosome 16p11.2 was detected in MasterShef7 by SNP array analysis. This region contained 26 protein-coding genes, and a duplication of this size has not been reported to date. Smaller duplications have been observed in this region, including a duplication (esv2758642) spanning 7 of the genes. Only published CNVs greater than 10 kb are represented here. Isolation of Genomic DNA. Genomic DNA was isolated from cell pellets using the MasterPure ™ Complete DNA and RNA Purification Kit (Epicentre) according to the manufacturer's instructions. Briefly, cell pellets were lysed with Tissue and Cell Lysis Solution, followed by Proteinase K and RNase A treatment. Proteins were precipitated with MPC Protein Precipitation Reagent and removed by centrifugation. Genomic DNA was precipitated with isopropanol, pelleted by centrifugation and then resuspended in TE buffer to a final concentration of 50 ng/μ l. Purity was checked by spectrophotometry using the NanoDrop 1000 spectrophotometer (Thermo Fisher Scientific).
HLA typing. All samples were subjected to HLA typing at the Histocompatibility and Immunogenetics (H&I) Laboratory of the Scottish National Blood Transfusion Service (SNBTS). The HLA typing data (Supplementary Table S1) was used for identification purposes by comparing it to the HLA typing data for each cell line provided by the hESC derivation centres. The H&I Laboratory at SNBTS is accredited through Clinical Pathology Accreditation (UK) Ltd (CPA), and all CPA labs are subjected to UK National External Quality Assessment Schemes (UK NEQAS).
Genotyping and Analysis. Genomic DNA samples were assayed using the Illumina HumanCytoSNP-12 v2.1 BeadChip, at either AROS (Aarhus, Denmark) or the Wellcome Trust Clinical Research Facility (Edinburgh, UK). The data have been deposited in NCBI's Gene Expression Omnibus and are accessible through GEO Series accession number GSE68508. Genotyping was initially assessed using GenomeStudio genotyping module (v1.94, Illumina). KaryoStudio (v1.4, Illumina) was employed to perform automatic normalisation and to identify genomic aberrations utilising default settings of the built-in cnvPartition algorithm (3.07, Illumina) to generate B-allele frequency and smoothened Log R ratio plots for detected regions. These parameters are designed to detect CNVs greater than 75 kb and CN-LOH regions larger than 1 MB with a confidence value greater than 35. All identified duplication and deletions were first cross-matched to the Database of Genomic Variants (DGV; http://dgv.tcag.ca) to identify naturally-occurring structural variations in the human genome 41,52 . We also determined the  Table S2) by accessing the DGV Gold Standard track of a highly curated and accurate CNV map of the human genome 47 . All CNVs were inputted into the DECIPHER database (https://decipher.sanger.ac.uk/) to determine if they were associated with any clinical syndromes 59 . The CN-LOH regions were cross-referenced with the Genomic Imprinting database (http://www.geneimprint.com) to determine if the genomic variants occurred in known imprinted regions 66 . CNVs that were not identified on the DGV were then checked against a list of ES cell-associated culture adaptation genomic variants published by the International Stem Cell Initiative 36 .