Complex regulatory networks influence pluripotent cell state transitions in human iPSCs

Stem cells exist in vitro in a spectrum of interconvertible pluripotent states. Analyzing hundreds of hiPSCs derived from different individuals, we show the proportions of these pluripotent states vary considerably across lines. We discover 13 gene network modules (GNMs) and 13 regulatory network modules (RNMs), which are highly correlated with each other suggesting that the coordinated co-accessibility of regulatory elements in the RNMs likely underlie the coordinated expression of genes in the GNMs. Epigenetic analyses reveal that regulatory networks underlying self-renewal and pluripotency are more complex than previously realized. Genetic analyses identify thousands of regulatory variants that overlapped predicted transcription factor binding sites and are associated with chromatin accessibility in the hiPSCs. We show that the master regulator of pluripotency, the NANOG-OCT4 Complex, and its associated network are significantly enriched for regulatory variants with large effects, suggesting that they play a role in the varying cellular proportions of pluripotency states between hiPSCs. Our work bins tens of thousands of regulatory elements in hiPSCs into discrete regulatory networks, shows that pluripotency and self-renewal processes have a surprising level of regulatory complexity, and suggests that genetic factors may contribute to cell state transitions in human iPSC lines.


Supplemental Data 4: Gene Network Module Memberships
Information about gene network module (GNM) analysis annotations for the 16,110 expressed genes, including the Gencode v34 gene ID (Column A) and gene name (Column B), the corresponding GNM (Column C), whether the corresponding GNM was used for downstream analyses (Major GNM, Column D), whether the gene was considered Pareto (Column E), the gene's co-expression degree connectivity across all 16,110 expressed genes (genome-wide degree, Column F), and the gene's co-expression degree connectivity (Column G) within its corresponding module (intramodular degree).

Supplemental Data 5: GNM Enrichment Results
Results from the Fisher's Exact test to calculate pluripotency cell state gene set enrichments.Information includes the annotation tested (Column A), the GNM tested (Column B), the odds ratio (Column C), the two-sided p-value (Column D), and the Benjamini-Hochberg corrected p-value (Column E).The table includes the narrow peak ID (Column A), the accessibility (TMM) of 200 differentially accessible peaks between the FACS-sorted formative (GCTM-2 high CD9 high EPCAM high , Column B) and primed (GCTM-2 mid -CD9 mid , column C) populations from Lau et al. 2020.

Supplemental Data 8: Regulatory Network Module (RNM) Memberships and Annotations
Information about regulatory network module (RNM) analysis annotations for the 56,978 accessible peaks, including the peak ID (Column A), peak chromosome, start and end positions (Column B-D), the corresponding RNM (Column E), whether the RNM was one of the 13 used for downstream analyses (Major RNM, Column F), whether the peak was considered a Pareto peak (Column G), the peak's coaccessibility degree connectivity (Column H) across all 56,978 accessible peaks (genome-wide degree), and the peak's co-accessibility degree connectivity (Column I) within its corresponding module (intramodular degree), the iPSC-18 ChromHMM chromatin state annotation (Column J), the collapsed chromatin states (Column K), and the Gencode v34 gene ID (Column L), gene name (Column M), the distance in base pairs (Column N) of the closest expressed gene after ROCK kinase inhibitor stimulation (see Methods), and whether the peak overlaps a Formative (Column O) or Primed (Column P) peak.

Supplemental Data 9: TOBIAS Predicted Binding Site Validation
This table includes information about the TOBIAS prediction validation analysis, including the ENCODE ID of the TF ChIP-seq data for H1 ESCs (Column A), the corresponding transcription factor (Column B), the two-sided p-value and the odds ratio for the Fisher's Exact tests (Columns C-D).

Supplemental Data 10: Transcription Factor Group Motif Memberships
Information about TF groups determined by TOBIAS predicted binding similarities for 187 motifs, including the HOCOMOCO motif ID (Column A), the Gencode v34 gene ID (Column B), gene name (Column C) and the name of the collapsed TF group to which the motif belongs (Column D).Note: The TOBIAS motif distance matrix and predicted binding sites for all 187 motifs were uploaded to GEO (GSE203377) and FigShare (136585).

Supplemental Data 11: Annotations of 56,978 peaks for binding of 92 TF groups
This table includes the TOBIAS-predicted transcription factor binding sites for all 56,978 ATAC-seq peaks.
Information includes; the peak ID (Column A), and binary annotations for the 92 collapsed TF groups and "Not Bound" peaks (Columns B-CP), where 1 indicates that there is a bound TF group on the corresponding peak.

Supplemental Data 12: RNM Annotation Enrichment Results
This table contains the Fisher's Exact test results for enrichments in the RNM Pareto peaks, including the annotation type (transcription factor, chromatin state, and cell state, Column A), the corresponding annotation (Column B), the tested RNM (Column C), the odds ratio, two-tailed p-value, and Benjamini-Hochberg corrected p-value (Columns D-F).

Supplemental Data 13: RNM Fetal Tissue ATAC-seq Enrichment Results
This table contains the Fisher's Exact test results for the single-cell fetal tissue-specific peak RNM enrichments, including the tested fetal cell type (Column A), the tested RNM (Column B), the odds ratio, two-tailed p-value, and Benjamini-Hochberg corrected p-value (Columns C-E).

Supplemental Data 14: Fetal Tissue TFBS Enrichment Results
This table contains the Fisher's Exact test results for the single-cell fetal tissue-specific peak TFBS enrichments, including the tested fetal cell type (Column A), the tested TFBS (Column B), the odds ratio, two-tailed p-value, and Benjamini-Hochberg corrected p-value (Columns C-E).

SUPPLEMENTAL NOTE 1
Weighted gene co-expression network analysis (WGCNA) 6 is the most commonly used gene module detection method, however, it cannot account for kinship (genetically related individuals).In this study, we used hiPSC lines from 219 individuals (Supplemental Data 1) recruited as part of the iPSCORE resource, of which 140 belonged to families composed of two or more subjects (range: 2-14 subjects).To address this confounding factor, we first applied an LMM to calculate gene co-expression and ATAC-seq peak coaccessibility, using kinship as the random effects term (See Methods).We loaded the edges of the significantly co-expressed genes and co-accessibility ATAC-seq peaks into a network and applied the Leiden community detection algorithm to detect modules.To determine if our approach or WGCNA is more suitable for module detection using iPSCORE resource samples, we applied WGCNA to calculate gene co-expression in the 213 hiPSC RNA-seq samples.Downstream analyses showed that the WGCNA modules were correlated with biological and technical covariates, as well as family structure (Supplemental Figure 3).We also determined that the LMM-Leiden module detection approach was more precise at identifying modules associated with formative-state-specific gene expression than WGCNA (Supplemental Figure 4).These results show that the conventional WGCNA module detection approach can be affected by donor relatedness and that accounting for kinship leads to more accurate module membership.

Supplemental Figure 1. Pluripotency cell states in human induced pluripotent stem cell lines.
Recent studies have shown that human induced pluripotent stem cell (hiPSC) lines are mosaics, composed of varying proportions of cells in different, interconvertible pluripotent cell states with distinct transcriptional circuitry and epigenomic profiles 1 .Live cell staining 1 and single-cell analyses 2 in hiPSCs have revealed that the interconvertible pluripotent cell states have morphological, transcriptional, and epigenetic profiles that correspond to different embryonic and extraembryonic lineage fates 1,2 .Subpopulation heterogeneity complicates the molecular characterization of these interconvertible stem cell stages.Most cells within a conventional hiPSC culture resemble the late post-implantation epiblast stem cells 1 , which resembles the primitive streak and are referred to as "primed".Primed stem cells co-express lineage-specific and pluripotency genes."Naïve" stem cells represent the cellular state of the inner cellular mass (ICM) preimplantation epiblast which gives rise to the embryo proper.At this same pre-implantation stage are the extra-embryonic primitive endoderm cells (PrE) that give rise to the primary yok sac.A pluripotent state called "formative" has been identified as developmentally between the naïve and primed states, which represents the early post-implantation epiblast (EPE).The formative pluripotent stage has been shown to be comprised of cells enriched for high self-renewal 1 .Typically, a small proportion of cells within a hiPSC line are totipotent and resemble the 8-cell morula (8 cell-like cells; 8CLC) 3 .The 8CLC subpopulation is enriched in catabolic processes, such as protein synthesis and RNA metabolism 3 .Created with BioRender.com.
Data 6: ATAC-seq Sample Information Information about the 263 individual ATAC-seq samples from iPSCORE individuals that were merged by library into 150 ATAC-seq samples.The table includes the subject universally unique identifier (UUID, Column A), the iPSCORE subject ID (Column B), the iPSC iPSCORE ID which consists of the iPSCORE subject ID, clone number and passage number (Column C), sample UUIDs for the 150 merged ATAC-seq libraries (Column D), sample UUIDs for the 263 ATAC-seq samples before merging (Column E), the iPSC clone number (Column F), the iPSC passage (Column G) and number of reads passing filters (Column H), the mean fragment size (Column I), the number of broad peaks used for QC (Column J), the ratio of 100bp reads to 150bp reads in the merged sample (1=100% 100bp reads and 0=100% 150bp reads; see Methods; Column K), the estimated formative proportion (Column L), the CIBERSORT deconvolution correlation values (Column M), and whether the merged sample was used as a reference for establishing a set of reference narrow peaks (Column N).Note: To obtain non-redundant data for the 150 merged ATAC-seq samples used in downstream analyses, use the unique rows from columns A-D and F-M.Supplemental Data 7: Signature Peak Matrix for ATAC-seq Cellular Deconvolution Signature gene expression matrix for CIBERSORT cellular deconvolution of 150 ATAC-seq samples.
SNPs tested for ASCA, including SNP ID and gnomad RSID (Columns A-B), Peak ID (Column C), number of reads mapping to the reference and alternative alleles (Columns D-E), the allelic imbalance fraction (Column F), the number of heterozygous individuals tested (Column G), the minor allele frequency (Column H), the two-sided p-value from the binomial test and Benjamini-Hochberg corrected p-value (Columns I-J), whether the SNP has ASCA (adjusted P-value < 0.05, Column K).