Technology utilizing human induced pluripotent stem cells (iPS cells) has enormous potential to provide improved cellular models of human disease. However, variable genetic and phenotypic characterization of many existing iPS cell lines limits their potential use for research and therapy. Here we describe the systematic generation, genotyping and phenotyping of 711 iPS cell lines derived from 301 healthy individuals by the Human Induced Pluripotent Stem Cells Initiative. Our study outlines the major sources of genetic and phenotypic variation in iPS cells and establishes their suitability as models of complex human traits and cancer. Through genome-wide profiling we find that 5–46% of the variation in different iPS cell phenotypes, including differentiation capacity and cellular morphology, arises from differences between individuals. Additionally, we assess the phenotypic consequences of genomic copy-number alterations that are repeatedly observed in iPS cells. In addition, we present a comprehensive map of common regulatory variants affecting the transcriptome of human pluripotent cells.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
Nature Communications Open Access 26 November 2022
BNIP3-dependent mitophagy safeguards ESC genomic integrity via preventing oxidative stress-induced DNA damage and protecting homologous recombination
Cell Death & Disease Open Access 19 November 2022
Molecular Psychiatry Open Access 16 November 2022
Subscribe to Nature+
Get immediate online access to Nature and 55 other Nature journal
Subscribe to Journal
Get full journal access for 1 year
only $3.90 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
Sterneckert, J. L., Reinhardt, P. & Schöler, H. R. Investigating human disease using stem cell models. Nat. Rev. Genet. 15, 625–639 (2014)
Kim, K. et al. Epigenetic memory in induced pluripotent stem cells. Nature 467, 285–290 (2010)
Kim, K. et al. Donor cell type can influence the epigenome and differentiation potential of human induced pluripotent stem cells. Nat. Biotechnol. 29, 1117–1119 (2011)
Lister, R. et al. Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells. Nature 471, 68–73 (2011)
Nazor, K. L. et al. Recurrent variations in DNA methylation in human pluripotent stem cells and their differentiated derivatives. Cell Stem Cell 10, 620–634 (2012)
Rouhani, F. et al. Genetic background drives transcriptional variation in human induced pluripotent stem cells. PLoS Genet. 10, e1004432 (2014)
Burrows, C. K. et al. Genetic variation, not cell type of origin, underlies the majority of identifiable regulatory differences in iPSCs. PLoS Genet. 12, e1005793 (2016)
Vallier, L. et al. Signaling pathways controlling pluripotency and early cell fate decisions of human induced pluripotent stem cells. Stem Cells 27, 2655–2666 (2009)
Müller, F. J. et al. A bioinformatic assay for pluripotency in human cells. Nat. Methods 8, 315–317 (2011)
Danecek, P., & McCarthy, S. A., HipSci Consortium & Durbin, R. A method for checking genomic integrity in cultured cell lines from SNP genotyping data. PLoS ONE 11, e0155014 (2016)
Laurent, L. C. et al. Dynamic changes in the copy number of pluripotency and cell proliferation genes in human ESCs and iPSCs during reprogramming and time in culture. Cell Stem Cell 8, 106–118 (2011)
The International Stem Cell Initiative. Screening ethnically diverse human embryonic stem cells identifies a chromosome 20 minimal amplicon conferring growth advantage. Nat. Biotechnol. 29, 1132–1144 (2011)
Abyzov, A. et al. Somatic copy number mosaicism in human skin revealed by induced pluripotent stem cells. Nature 492, 438–442 (2012)
Mayshar, Y. et al. Identification and classification of chromosomal aberrations in human induced pluripotent stem cells. Cell Stem Cell 7, 521–531 (2010)
Taapken, S. M. et al. Karotypic abnormalities in human induced pluripotent stem cells and embryonic stem cells. Nat. Biotechnol. 29, 313–314 (2011)
Hussein, S. M. et al. Copy number variation and selection during reprogramming to pluripotency. Nature 471, 58–62 (2011)
Laurin, M. & Côté, J. F. Insights into the biological functions of Dock family guanine nucleotide exchange factors. Genes Dev. 28, 533–547 (2014)
Zhang, X. et al. FATS is a transcriptional target of p53 and associated with antitumor activity. Mol. Cancer 9, 244 (2010)
Lo, J. Y., Chou, Y. T., Lai, F. J. & Hsu, L. J. Regulation of cell signaling and apoptosis by tumor suppressor WWOX. Exp. Biol. Med. 240, 383–391 (2015)
Futreal, P. A. et al. A census of human cancer genes. Nat. Rev. Cancer 4, 177–183 (2004)
Duckett, C. S. et al. A conserved family of cellular genes related to the baculovirus iap gene and encoding apoptosis inhibitors. EMBO J. 15, 2685–2694 (1996)
Chia, N. Y. et al. A genome-wide RNAi screen reveals determinants of human embryonic stem cell identity. Nature 468, 316–320 (2010)
Belinky, F . et al. PathCards: multi-source consolidation of human biological pathways. Database (Oxford) 2015, bav006 (2015)
GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015)
Grundberg, E. et al. Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat. Genet. 44, 1084–1089 (2012)
Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015)
Xu, H. et al. ESCAPE: database for integrating high-content published data collected from human and mouse embryonic stem cells. Database (Oxford) 2013, bat045 (2013)
Giambartolomei, C. et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet. 10, e1004383 (2014)
Dubois, P. C. et al. Multiple common variants for celiac disease influencing immune gene expression. Nat. Genet. 42, 295–302 (2010)
Stranger, B. E. et al. Population genomics of human gene expression. Nat. Genet. 39, 1217–1224 (2007)
Zeller, T. et al. Genetics and beyond—the transcriptome of human monocytes and disease susceptibility. PLoS One 5, e10693 (2010)
Purrington, K. S. et al. Genome-wide association study identifies 25 known breast cancer susceptibility loci as risk factors for triple-negative breast cancer. Carcinogenesis 35, 1012–1019 (2014)
Garcia-Closas, M. et al. Genome-wide association studies identify four ER negative-specific breast cancer risk loci. Nat. Genet. 45, 392–398 (2013)
Wang, Z. et al. Imputation and subset-based association analysis across different cancer types identifies multiple independent risk loci in the TERT–CLPTM1L region on chromosome 5p15.33. Hum. Mol. Genet. 23, 6616–6633 (2014)
Li, Q. et al. Integrative eQTL-based analyses reveal the biology of breast cancer risk loci. Cell 152, 633–641 (2013)
Ongen, H. et al. Putative cis-regulatory drivers in colorectal cancer. Nature 512, 87–90 (2014)
Chen, Q. R., Hu, Y., Yan, C., Buetow, K. & Meerzaman, D. Systematic genetic analysis identifies cis-eQTL target genes associated with glioblastoma patient survival. PLoS One 9, e105393 (2014)
Bojesen, S. E. et al. Multiple independent variants at the TERT locus are associated with telomere length and risks of breast and ovarian cancer. Nat. Genet. 45, 371–384 (2013)
Chiba, K. et al. Cancer-associated TERT promoter mutations abrogate telomerase silencing. eLife 4, e07918 (2015)
Kyttälä, A. et al. Genetic Variability Overrides the Impact of Parental Cell Type and Determines iPSC Differentiation Potential. Stem Cell Reports 6, 200–212 (2016)
Kajiwara, M. et al. Donor-dependent variations in hepatic differentiation from human-induced pluripotent stem cells. Proc. Natl Acad. Sci. USA 109, 12538–12543 (2012)
Choi, J. et al. A comparison of genetically matched cell lines reveals the equivalence of human iPSCs and ESCs. Nat. Biotechnol. 33, 1173–1181 (2015)
Gerrits, A. et al. Expression quantitative trait loci are highly sensitive to cellular differentiation state. PLoS Genet. 5, e1000692 (2009)
Spies, N. et al. Constraint and divergence of global gene expression in the mammalian embryo. eLife 4, e05538 (2015)
Cannavò, E. et al. Genetic variants regulating expression levels and isoform diversity during embryogenesis. Nature 541, 402–406 (2017)
Carcamo-Orive, L. et al. Analysis of transcriptional variability in large human iPSC library reveals genetic and non-genetic determinants of heterogeneity. Cell Stem Cell 20, 518–532 (2017)
DeBoever, C. et al. Large-scale profiling reveals the influence of genetic variation on gene expression in human induced pluripotent stem cells. Cell Stem Cell 20, 533–546 (2017)
Kim, N. W. et al. Specific association of human telomerase activity with immortal cells and cancer. Science 266, 2011–2015 (1994)
Kelly, L. M. & Gilliland, D. G. Genetics of myeloid leukemias. Annu. Rev. Genomics Hum. Genet. 3, 179–198 (2002)
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012)
Takahashi, K. & Yamanaka, S. Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell 126, 663–676 (2006)
The UK10K Consortium. The UK10K project identifies rare variants in health and disease. Nature 526, 82–90 (2015)
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015)
Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009)
Delaneau, O., Marchini, J. & Zagury, J. F. A linear complexity phasing method for thousands of genomes. Nat. Methods 9, 179–181 (2011)
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009)
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012)
Huber, W., von Heydebreck, A., Sültmann, H., Poustka, A. & Vingron, M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18 (Suppl 1), S96–S104 (2002)
Vallier, L. et al. Early cell fate decisions of human embryonic stem cells and mouse epiblast stem cells are controlled by the same signalling pathways. PLoS ONE 4, e6082 (2009)
Ly, T. et al. A proteomic chronology of gene expression through the cell cycle in human myeloid leukemia cells. eLife 3, e01630 (2014)
Bensaddek, D. et al. Micro-proteomics with iterative data analysis: proteome analysis in C. elegans at the single worm level. Proteomics 16, 381–392 (2016)
Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26, 1367–1372 (2008)
Aryee, M. J. et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics 30, 1363–1369 (2014)
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013)
Anders, S., Pyl, P. T. & Huber, W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015)
Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010)
DeLuca, D. S. et al. RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics 28, 1530–1532 (2012)
Leha, A. et al. A high-content platform to characterise human induced pluripotent stem cell lines. Methods 96, 85–96 (2016)
Zack, T. I. et al. Pan-cancer patterns of somatic copy number alteration. Nat. Genet. 45, 1134–1140 (2013)
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005)
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011)
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015)
Stegle, O., Parts, L., Durbin, R. & Winn, J. A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLOS Comput. Biol. 6, e1000770 (2010)
Lippert, C., Casale, F. P., Rakitsch, B. & Stegle, O. LIMIX: genetic analysis of multiple traits. Preprint at http://biorxiv.org/content/early/2014/05/21/003905 (2014)
Casale, F. P., Rakitsch, B., Lippert, C. & Stegle, O. Efficient set tests for the genetic analysis of correlated traits. Nat. Methods 12, 755–758 (2015)
Storey, J. D. & Tibshirani, R. Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA 100, 9440–9445 (2003)
Pers, T. H., Timshel, P. & Hirschhorn, J. N. SNPsnap: a web-based tool for identification and annotation of matched SNPs. Bioinformatics 31, 418–420 (2015)
Lawrence, M., Gentleman, R. & Carey, V. rtracklayer: an R package for interfacing with genome browsers. Bioinformatics 25, 1841–1842 (2009)
Farh, K. K. et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature 518, 337–343 (2015)
Lambert, J. C. et al. Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease. Nat. Genet. 45, 1452–1458 (2013)
Trynka, G. et al. Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease. Nat. Genet. 43, 1193–1201 (2011)
Liu, J. Z. et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat. Genet. 47, 979–986 (2015)
International Multiple Sclerosis Genetics Consortium (IMSGC). Analysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis. Nat. Genet. 45, 1353–1360 (2013)
Faraco, J. et al. ImmunoChip study implicates antigen presentation to T cells in narcolepsy. PLoS Genet. 9, e1003270 (2013)
Cordell, H. J. et al. International genome-wide meta-analysis identifies new primary biliary cirrhosis risk loci and targetable pathogenic pathways. Nat. Commun. 6, 8019 (2015)
Tsoi, L. C. et al. Identification of 15 new psoriasis susceptibility loci highlights the role of innate immunity. Nat. Genet. 44, 1341–1348 (2012)
Okada, Y. et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 506, 376–381 (2014)
Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014)
Bentham, J. et al. Genetic association analyses implicate aberrant regulation of innate and adaptive immunity genes in the pathogenesis of systemic lupus erythematosus. Nat. Genet. 47, 1457–1464 (2015)
Onengut-Gumuscu, S. et al. Fine mapping of type 1 diabetes susceptibility loci and evidence for colocalization of causal variants with lymphoid gene enhancers. Nat. Genet. 47, 381–386 (2015)
Morris, A. P. et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat. Genet. 44, 981–990 (2012)
Li, Y. I. et al. RNA splicing is a primary link between genetic variation and disease. Science 352, 600–604 (2016)
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010)
This work was funded with a strategic award from the Wellcome Trust and UK Medical Research Council (WT098503). We thank the staff in the Cellular Genetics and Phenotyping and Sequencing core facilities at the Wellcome Trust Sanger Institute. Work at the Wellcome Trust Sanger Institute was further supported by Wellcome Trust grant WT090851. H.K. is supported by a MRC eMedLab Medical Bioinformatics career development award from the UK Medical Research Council (MR/L016311/1). F.M.W. acknowledges financial support from the Department of Health via the NIHR Biomedical Research Centre award to Guy’s & St Thomas’ National Health Service Foundation Trust in partnership with King’s College London and King’s College Hospital NHS Foundation Trust. We acknowledge the participation of all NIHR Cambridge BioResource volunteers, and thank the NIHR Cambridge BioResource centre staff for their contribution. We thank the National Institute for Health Research and NHS Blood and Transplant. The NIHR/Wellcome Trust Cambridge Clinical Research Facility supported the volunteer recruitment. We acknowledge Life Science Technologies Corporation as the provider of Cytotune. We thank F.-J. Müller for insights regarding the PluriTest method, and the GTEx consortium for making raw data and intermediate results available.
The authors declare no competing financial interests.
Reviewer Information Nature thanks E. Dermitzakis, S. Montgomery and the other anonymous reviewer(s) for their contribution to the peer review of this work.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
a, Example plate layout for the cellular differentiation assay. Images are shown for the pluripotency markers (OCT4, SOX2 and NANOG) as they are measured in the Cellomics imaging device. Each line is measured in two rows of the same plate as technical replicates. The secondary antibody used for each marker is shown in parenthesis (aG, anti-goat antibody; aM, anti-mouse antibody). Each plate also has measurements for staining with the secondary antibody only, which serves as a means to assess background fluorescence. The red channel shows the signal from the DAPI staining, the green channel the marker signal. As expected, there is only a small signal from the green channel in the wells stained for the secondary antibody only. Image acquisition stops as soon as 10,000 cells have been detected. b, Detailed variance components of the cellomics markers (Methods). Substantial proportions of the marker variance could be attributed to batch factors, including staining, technician effects and antibody lots. These effects mean that the fraction of cells expressing particular markers need to be interpreted with caution (Fig. 1c, d). c, Pairwise Pearson’s correlation (r) between quantitative expression scores derived from immunostaining for pluripotency and differentiation and the PluriTest score (P values from a Student’s t-test).
a–c, Comparison of PluriTest novelty score versus pluripotency score for the 711 lines generated. Lines grown on feeder-free conditions (E8 medium) scored systematically lower than feeder-dependent lines (P = 1.62 × 10−43; t-test, for pluripotency score). We note that, while we cannot rule out that feeder-free lines are less pluripotent, feeder-free conditions are not well represented in the PluriTest training dataset, which may explain this result (of the 204 ES cell/iPS cell lines in the PluriTest paper that have medium metadata available, none were cultured in E8 and only 37 were cultured in a variety of other feeder-free formulations such as MTSER). d, Despite lower pluripotency scores, lines grown on feeder-free conditions have higher fractions of cells expressing canonical protein markers of pluripotency.
Relationship between the number of CNAs, using three CNA minimum length thresholds for calling CNAs: 200 kb, 500 kb and 1,000 kb and other experimental factors. Values on the x axis have been ‘jittered’ (that is, small random ‘noise’ has been added to the true values) to enhance the visualization. Data points underlying the box plots are shown as semi-transparent blue dots. a, Number of CNAs per line versus passage number. P values are shown from a generalized linear mixed model (Poisson regression) with donor random effect. b, Box plot of the number of autosomal CNAs per line versus growth medium. P values are for a Poisson regression for culture condition. c, d, Number of autosomal CNAs per line versus PluriTest pluripotency and novelty scores. P values are for a linear mixed model of the number of autosomal CNAs per line with a random effect for donor. e, f, Number of CNA counts per donor versus gender and donor age. CNA counts refer to the total number of unique CNAs across all lines derived from the same donor. CNAs that are shared between lines of the same donor (overlap by at least one base) are counted only once. P values are shown for a Poisson regression for either gender or age.
Related to Fig. 2. Top, genomic location versus number of lines with copy number 3 (grey) and with a CNA (black). Bottom, the NAV gene score from ref. 22 and log2 gene expression fold change between the iPS cell lines with copy number 2 and 3a (colour scale), in the region highlighted in red in the top panel. Highlighted genes are upregulated when copy number increases, known oncogenes/tumour-suppressor genes and/or genes with a NAV score in the top 2%.
a–c, Cell growth rate (a), proliferation (b) and apoptosis (c) in cell lines with copy number 2 (wild type, blue dots) or copy number 3 (mutant, red dots) in a recurrently duplicated region in iPS cells on chromosome 1, 17 or 20. Plot titles show the donor name and the genomic coordinates of the CNA. a, Cell counts taken on successive days in culture, for pairs of lines (one mutant, one wild type) grown on the same 24-well plates are shown. Asterisks denote significance levels for statistical interactions between day and copy number in a linear mixed model, using fixed effects to fit day and copy number, and random effects to account for culture plate effects. EIF4A3 denotes whether a copy number variant overlaps one of the suspected candidate genes on chromosome 17. *P < 0.05; **P < 0.01; ***P < 0.001. b, Protein expression level measured using TMT-based quantification using the Q-exactive plus (labelled QE Plus) orbitrap and a fusion (labelled Fusion) orbitrap mass spectometry platforms. c, Estimated fraction of fluorescing nuclei following an EdU assay in mutant and wild-type lines, following exposure to mitomycin (Treated) or in a control sample (Untreated). d, Estimated fraction of fluorescing nuclei following a terminal deoxynucleotidyl transferase dUTP nick end labelling (TUNEL) assay in mutant and wild-type lines, following exposure to mitomycin (Treated) or in a control sample (Untreated). Solid trend lines are least squares regression fits. P values in b and c denote the significance of statistical interactions between copy number and mitomycin treatment condition (treated or untreated).
Extended Data Figure 6 Effect of passage number on tier 1 and tier 2 data and overview of iPS cell cis eQTLs mapped with tier 1 gene expression array data.
a, b, Passage number versus PluriTest pluripotency and novelty scores shows no significant association between passage number and pluripotency. Trend lines are fit using a linear regression of the PluriTest scores and the passage number (score P = 0.66, novelty P = 0.21). Association was also not significant when including gender and medium as fixed effects and batch variables and donor as random effects (score P = 0.3, novelty P = 0.14). c, Passage number versus log10 RNA-seq expression of pluripotency factors NANOG and OCT4 shows no significant association between passage number and pluripotency. Trend lines are fit using linear regression of log10 expression and passage number (NANOG, P = 0.5; POU5F1, P = 0.15). Association was also not significant when considering the two genes together and when including gender and medium as fixed effects and batch variables and donor as random effects (passage, P = 0.28; passage–gene interaction, P = 0.96). d, e, Variance component analysis for tier 2 assays, showing that for the majority of genes gender and passage explained little of the total variance. f, g, Comparison of lead variant effect sizes (β2) in gexarray-based eQTL maps. The eQTL maps were derived by using mean expression levels per donor (‘main’ map) and with two sets of individual lines (one per donor), drawn randomly (‘replicate’ maps). The effect sizes for all tested genes are shown in black, with FDR < 5% eGenes from the main map indicated in blue. Effect sizes are compared between the two replicate maps (f, Spearman’s rank correlation ρ = 0.47 genome-wide, ρ = 0.80 eGenes only, both P < 2.2 × 10−16) and between the main map and one replicate map (g, Spearman’s rank correlation ρ = 0.57 genome-wide, ρ = 0.88 eGenes only, both P < 2.2 × 10−16). The effect sizes obtained using the mean expression values per donor are higher than when using individual lines. h, Pairwise correlation between gene expression levels in iPS cells measured with RNA-seq and gexarray. The Spearman rank correlation coefficients of either gene (pink) or gexarray probe (blue) region based read counts are shown, showing a higher correlation for probe-based counts.
a, b, The power to detect eQTLs is plotted, comparing 44 somatic tissues from GTEx24 (V6p) and the HipSci RNA-seq-based eQTL map (purple triangle), considering either the absolute (a) or relative (b) number of eQTLs identified (eGenes, FDR < 5%). The major determinant of eQTL detection power is sample size. c, Cumulative fraction of RNA-seq reads relative to the number of protein-coding genes expressed. The mean read count derived from 20 iPS cell lines (10 donors, two lines each) is plotted, five fibroblast lines, and two ES cell (ESC) lines. In iPS cells, half of the reads are explained by the expression of 1,071 genes, whereas 75% and 90% of the reads are explained by the expression of 3,159 and 5,814 genes, respectively (total protein-coding genes with non-zero counts n = 17,332). d, Distribution of iPS cell eQTLs around the annotated gene start position. The −log10(eQTL P value) is plotted against the distance (in bp) from the gene start for lead eQTL variants genome-wide, highlighting significant eQTLs (FDR < 5%) in orange. e, Comparison of the magnitude of eQTL effect size (absolute beta; left) and minor allele frequency (MAF; right) between iPS-cell-specific (n = 2,131; labelled as S) and non-specific eQTLs (n = 4,500; labelled as NS), demonstrating that overall, iPS-cell-specific eQTLs have smaller effects on the transcriptome than eQTLs shared among multiple tissues (P = 9.97 × 10−161; Wilcoxon rank-sum test) and have a lower minor allele frequency (P = 1.08 × 10−35, Wilcoxon rank-sum test).
a, Proportion of tissue-specific eQTLs as a function of the discovery sample size. For iPS cells, the two sets of tissue-specific eQTLs obtained with the two different mapping pipelines (Methods) are shown, namely the standard HipSci pipeline (iPSC; purple triangle) and the alternative ‘GTEx-like’ pipeline (iPSC2; purple triangle). Points other than iPS cells are from the GTEx Consortium (44 somatic tissues and cell lines)24. b, Heat map of pairwise π1 values (π1 = 1 − π0) between iPS cells and GTEx tissues, with rows representing the discovery tissue and columns the replication tissue. Clustering of tissues is based on euclidean distance (R hclust, method = average). c, Effect of eQTL replication threshold on the definition of tissue-specific effects. The replication profile of iPS cell eQTLs across GTEx tissues relative to discovery sample size in each replication tissue is shown. The proportion of lead eQTLs from iPS cells that replicate in each tissue is plotted, with replication defined using two different replication thresholds (TH1: nominal eQTL P < 0.01 / ntissues; TH5: P < 0.05 / ntissues; plotted as dots and triangles, respectively). d, Enrichment of alternative iPS cell eQTLs (GTEx-like) at proximal and distal (defined as less than or greater than 2 kb from the transcription start site) transcription factor binding sites of promoters in H1 hES cells from the ENCODE Project50. Fold enrichments per factor are shown for iPS-cell-specific and non-specific eQTLs (minimum 10 observed overlaps) (Methods). Pluripotency-associated factors are indicated with an asterisk. The profile of enrichments is comparable to that obtained with the standard HipSci pipeline (Fig. 4d).
a, Cumulative number of cancer genes (COSMIC cancer census 27 April 2016; ngenes = 571; ref. 20) regulated by eQTLs in iPS cells, somatic tissues (GTEx V6p), and three different cancers (ER positive and negative breast cancer, colorectal cancer)32,33. b, Enrichment of iPS cells and somatic eQTLs (lead variants and their high-linkage-disequilibrium proxies) at disease-associated variants in the NHGRI–EBI GWAS catalogue (10 April 2016). The fold enrichment of eQTLs over 100 random sets of matched variants for each tissue relative to eQTL discovery sample size is shown. The tissues showing the highest fold enrichment are liver and brain (cerebellar hemisphere; BrainCH). c, Somatic eQTL signal for the PTPN2 (protein tyrosine phosphatase, non-receptor type 2) locus on chromosome 18. This locus contains a colocalizing association signal for PTPN2 gene expression in iPS cells and five immunological disease phenotypes (Fig. 5a). d, Somatic eQTL signal for the TERT (telomerase reverse transcriptase) locus on chromosome 5 (Fig. 5b). In both c and d, the lead eQTL variant locations are indicated with red and orange vertical lines for iPS cells and somatic tissues, respectively. The focal gene regions are indicated in solid grey and gene start positions of other protein-coding genes on the same strand with vertical grey lines.
a, b, Normalized RNA-seq per-base coverage across the TERT locus stratified by rs10069690 genotype. The full locus (a) or zoomed view of the region (b) around the lead eQTL and cancer risk variant rs10069690 are shown. rs10069690 is indicated with a dotted line on each plot. Grey regions indicate annotated exons from Ensembl version 75. Coverage was computed from indexed BAM files using the coverageBed function from the bedtools (version 2.25.0)93. Raw coverage was divided by total library size in millions (total number of mapped reads) per sample to obtain normalized coverage, which was then averaged over samples with the same rs10069690 genotype to obtain mean normalized coverage for each genotype group. c, Profile of TERT expression in iPS cells and across somatic tissues from GTEx. The gene FPKM values obtained with RNA-SeQC (GTEx V6p) are shown. d, Splicing-QTL of TERT. We quantified TERT intron retention rates using Leafcutter92 and identified one alternative splicing event associated with rs10069690, the lead iPS cell eQTL variant for TERT (Fig. 5b). The TERT intron 4 retention ratio (PSI, per cent spliced in) is shown in iPS cell lines of all individual donors stratified by their genotype at rs10069690. This variant affects the splicing of the intron where it is located, with the minor allele (T) increasing the fraction of TERT transcripts in which intron 4 is retained (P = 1.7 × 10−9, Bonferroni-adjusted linear regression).
This file contains supplementary information and methods. (PDF 337 kb)
Sample meta data for the HipSci cell lines used in this publication. This is a subset of HipSci's full catalogue of cell lines and data, which can be queried at http://www.hipsci.org/lines. (XLSX 184 kb)
CNA results. (a) CNA locations (b) Significance of CNA recurrence over 200 kb genome windows (c) Properties of the recurrent CNAs, including: peak region, overlap with chromatin fragile sites, cis (same chromosome from the CNA) and trans (different chromosome) regulated genes (i.e. genes differentially expressed between copy-number 2 and 3 lines), and top candidate genes (identified as described in the main text). (d) Genome-wide association of copy numbers at recurrent CNAs with gene expression (e) Pathway enrichment analysis of genes regulated in trans by the chromosome 17 recurrent CNA region. (XLSX 199 kb)
Gene expression variance components analysis. Fraction of variance explained by the factors considered for each expression array probe. (XLSX 3388 kb)
iPSC eQTL results. (a) eGene level summary of the cis-eQTLs discovered with different sample sets in this study. (b) eQTL results for primary and secondary lead eQTL variants of HipSci RNA-seq iPSC eGenes at FDR < 5% (N = 6,631). Primary and secondary eQTLs are defined by the column ‘primary_eQTL’. The column ‘iPSC_specific’ defines whether the eQTL is iPSC-specific. Columns ‘N_proxies_used’ and ‘proxy_positions’ give the total number and positions of proxy variants that were tested in the tissue-specific analysis. Additionally, the column ‘overlaps_CNA’ indicates whether the eQTL lead variant overlaps with a recurrent iPSC CNA. (XLSX 1545 kb)
Tissue information. (a) Description of the tissue data used in this study to define tissuespecific eQTLs (GTEx V6p, HipSci), including the embryonic origin of each tissue and number of tissue-specific eQTLs identified for each tissue. (b) Summary of iPSC eQTL replication tests in the tissue-specific analysis, showing for each replication tissue how often proxy variants (‘ld_buddy’, ‘best_proxy’) were tested instead of the same lead variant (‘same_as_lead’). (XLSX 17 kb)
iPSC eQTL overlap with disease-associated variants. (a) All disease-associated variants in the NHGRI-EBI GWAS catalogue (release 2016-04-10) which are tagged by an iPSC eQTL (lead variant or r2 > 0.8 proxy). For proxy matches, all eQTLs for which the variant is a proxy (r2 > 0.8) are shown. (b) Disease-associated variants in the GWAS catalogue that are lead eQTL variants in iPSCs (subset of (a)). For each variant, the number of high-LD proxies it has is listed (‘N_HIGH_LD_PROXIES’). (c) Individual traits in the GWAS catalogue for which iPSC eQTLs show a significant enrichment (BH-adjusted empirical P < 0.05, derived from 100 random sets of matched variants; Methods). Shown are traits with minimum five variants tagged by iPSC eQTLs. (d) Results of the colocalisation analysis for 14 traits. (XLSX 121 kb)
About this article
Cite this article
Kilpinen, H., Goncalves, A., Leha, A. et al. Common genetic variation drives molecular heterogeneity in human iPSCs. Nature 546, 370–375 (2017). https://doi.org/10.1038/nature22403
This article is cited by
Induced pluripotent stem cell-derived cells model brain microvascular endothelial cell glucose metabolism
Fluids and Barriers of the CNS (2022)
Nature Communications (2022)
Molecular Psychiatry (2022)
Nature Methods (2022)