Single-cell RNA-sequencing of differentiating iPS cells reveals dynamic genetic effects on gene expression

Recent developments in stem cell biology have enabled the study of cell fate decisions in early human development that are impossible to study in vivo. However, understanding how development varies across individuals and, in particular, the influence of common genetic variants during this process has not been characterised. Here, we exploit human iPS cell lines from 125 donors, a pooled experimental design, and single-cell RNA-sequencing to study population variation of endoderm differentiation. We identify molecular markers that are predictive of differentiation efficiency, and utilise heterogeneity in the genetic background across individuals to map hundreds of expression quantitative trait loci that influence expression dynamically during differentiation and across cellular contexts.


Introduction 38
The early stages of human embryogenesis involve dramatic and dynamic changes in cellular 39 states. However, the extent to which an embryo's genetic background influences this process 40 has only been determined in a small number of special cases linked to rare large-effect 41 variants that cause developmental disorders. This lack of information is critical -it can provide 42 a deep understanding of how genetic heterogeneity is tolerated in normal development, when 43 controlling the expression of key genes is vital. Additionally, with cellular reprogramming 44 becoming an increasingly used tool in molecular medicine, understanding how inter-individual 45 variability effects such differentiations is key. 46 47 Critically, recent technological developments have begun to facilitate such studies in vitro. In 48 particular, the generation of population-scale collections of human induced pluripotent stem 49 cells (iPSCs) [1,2] has allowed for assessing regulatory genetic variants in pluripotent [1,2] as 50 well as in differentiated cells [3][4][5]. In addition, the rapid developments in single-cell RNA-seq 51 now allow for assessing the molecular impact of genetic variability in a continuous manner 52 across early human development. 53 54 Here, we use a pooled cell differentiation assay to study endoderm differentiation across a set 55 of 125 human iPSC lines, profiling changes in gene expression via single-cell RNA-56 sequencing at 4 developmental timepoints [6]. Our study not only allows discovery of hundreds 57 of novel expression Quantitative Trait Loci (eQTL) that vary across differentiation, but also 58 enables the uncovering of genetic variants that impact the rate at which a cell line 59 differentiates. Finally, we generalise approaches from studies of the interaction between 60 genotype and environment (GxE) by leveraging the single-cell resolution of our study to 61 investigate the interplay between genetic factors and cellular states. 62 98 99 100   Pseudo-temporal ordering yields stage-specific eQTL 113 Motivated by the observation that a substantial fraction of variability in gene expression was 114 explained by cell-line effects (Fig. 1B), we tested for associations between common genetic 115 variants and gene expression at the three defined stages of cell differentiation (Fig. 1C). 116 Briefly, for each donor, experimental batch, and differentiation stage, we quantified each 117 gene's average expression level (Methods), before using a linear mixed model to test for cis 118 eQTL, adapting approaches used for bulk RNA-seq profiles (+/-250kb, MAF > 5% [1]; 119 Methods). In the iPSC population (day0), this identified 1,833 genes with at least one eQTL 120 (denoted eGenes; FDR < 10%; 10,840 genes tested; Table S3). To validate our approach, we 121 also performed eQTL mapping using deep bulk RNA-sequencing data from the same set of 122 iPSC lines ("iPSC bulk"; 10,736 genes tested), yielding consistent eQTL (~70% replication of 123 lead eQTL effects; nominal P < 0.05; Methods; Table S4). 124 125 Analogously, we mapped eQTL in the mesendo and defendo populations, yielding 1,702 and 126 1,342 eGenes respectively. For comparison, we also performed eQTL mapping in cells 127 collected on day1 and day3 --the experimental time points commonly used to identify cells at 128 mesendo and defendo stages [6]. Interestingly, this approach identified markedly fewer 129 eGenes (1,181 eGenes at day1, and 631 eGenes at day3), demonstrating the power of using 130 the single-cell RNA-seq profiles to define relatively homogeneous differentiation stages in a 131 data-driven manner (Fig. 2B, S9; Methods; Table S5). 132 133 Profiling multiple stages of endoderm differentiation allowed us to assess at which stage along 134 this process individual eQTL can be detected. We observed substantial regulatory and 135 transcriptional remodelling upon iPS differentiation to definitive endoderm, with over 30% of 136 eQTL being specific to a single stage ( Fig. 2A, 2C; Methods). Our differentiation time course 137 covers developmental stages that have never before been accessible to genetic analyses of 138 molecular traits. Consistent with this, 349 of our eQTL variants at the mesendo and defendo 139 stages have not been reported in either a recent iPSC eQTL study based on bulk RNA-seq 140 [10], or in a compendium of eQTL identified from 49 tissues as part of the GTEx project [11] 141 (linkage disequilibrium, LD: r 2 < 0.2; Methods; Table S3). 142 143 In addition to these novel eQTL, we identified lead switching events for 155 eGenes, that is 144 distinct lead eQTL variants for the same gene at different stages of differentiation (LD: r 2 < 0.2; 145 Methods). To investigate the potential regulatory role of such variants, we examined whether 146 the corresponding genetic loci also featured changes in histone modifications during 147 differentiation. Specifically, we used ChIP-Sequencing to profile five histone modifications 148 associated with gene and enhancer usage (H3K27ac, H3K4me1, H3K4me3, H3K27me3, 149 H3K36me3) in hESCs that were differentiated (using the same protocol employed above) 150 towards endoderm and measured at equivalent time points (i.e. day0, day1, day2, day3; 151 Methods). Intriguingly, for 20 of the lead switching events, we observed corresponding 152 changes in the epigenetic landscape (stage-specific lead variants overlap with stage-specific 153 changes in histone modification status), suggesting a direct mode of action (Fig. 2D). Previous studies have demonstrated that iPSC lines vary in their capacity to differentiate [12]. 171 As a measure of differentiation efficiency in our experiments, we used average pseudotime 172 on day3, and observed significant variation across cell lines, which was consistent across 173 replicate differentiations of the same cell line (Fig. 3A). Exploiting the scale of our study and 174 the pooled experimental design, we set out to identify genetic and molecular markers of 175 differentiation efficiency that are accessible prior to differentiation (Methods). 176 177 First, we considered the set of 4,422 eQTL lead variants at any of the three developmental 178 stages and tested each variant for association with differentiation efficiency (Fig. 3B; using a 179 linear mixed model; Methods). This identified 5 eQTL variants at a lenient false discovery rate 180 threshold (FDR 20%; Fig. 3B, Table S6). The most significant associations were observed 181 with eQTL variants for DPH3 (P = 3.9e-5) and H2AFY2 (P = 1e-4 Having identified genetic markers associated with differentiation capacity we next asked 195 whether the average expression level of genes at the iPSC stage could represent molecular 196 markers of differentiation efficiency. This revealed 38 associations (FDR 10%, 11,231 genes 197 tested; Table S7), 15 of which were also observed when using independent bulk RNA-seq 198 data from the same cell lines (replication defined as nominal P < 0.05; Table S7; Methods). 199 As an example, the expression of ZDHHC9 in iPSCs was negatively associated with 200 differentiation efficiency (Fig. 3C). Furthermore, ZDHHC9 is one of 17 differentiation-201 associated genes located on the X chromosome, reflecting a significant enrichment of X 202 chromosome genes (24.5-fold enrichment, P = 8x10 -16 , Fisher's exact test). Higher expression 203 of these genes was associated with reduced differentiation efficiency ( Fig. 3C; Methods). The 204 majority of these associations persisted when limiting the analysis to female lines (14/17 at P 205 < 0.05), indicating variation beyond differences between sexes. These results are consistent 206 with previous observations that X chromosome reactivation is a marker of poor differentiation 207 capacity of iPSCs in general [16,17]. Finally, we note that the set of associated genes located 208 on other chromosomes included genes with known roles in iPSC differentiation, such as TBX6 209 [18].  or genetic effect dynamics. Reassuringly, the two approaches were highly consistent across 243 pseudotime (Fig. 4A, S10).

245
To formally test for eQTL effects that change dynamically across differentiation (dynamic 246 QTL), we tested for associations between pseudotime and the genetic effect size (defined 247 based on ASE; likelihood ratio test, considering linear and quadratic pseudotime), uncovering 248 a total of 785 time dynamic eQTL (FDR < 10%; Methods), including a substantial fraction of 249 eQTL that were not stage-specific (Table S3). This complements our earlier analysis, which 250 identified substantial stage-specific effects ( Fig. 2A, 2C), by identifying subtle changes in the 251 relationship between genotype and phenotype during differentiation. To further explore this 252 set of genes, we clustered eQTL jointly based on the relative gene expression dynamics 253 (global expression changes along pseudotime, quantified in sliding windows as above, 254 Methods), and on the genetic effect dynamics ( Fig. 4A; Methods). This identified four basic 255 dynamic patterns (Fig. 4B): sharply decreasing (cluster A), gradually decreasing (cluster B), 256 transiently increasing (cluster C), and gradually increasing (cluster D). As expected, stage-257 specific eQTL were grouped together in particular clusters (e.g. defendo specific eQTL in 258 cluster D; Fig. S11). Notably, the gene expression dynamics and the eQTL dynamics tended 259 to be distinct, demonstrating that gene expression level is not the primary mechanism 260 governing variation in genetic effects. In particular, genetic effects were not most pronounced 261 when gene expression was high (Fig. 4C, 4D). 262 263 264  expression. This is illustrated by the mesendoderm-specific eQTL for SPP1. Overall 286 expression of SPP1 decreases during differentiation, but expression of the alternative allele is 287 repressed more quickly than that of the reference allele (Fig. 4C). This illustrates how cis 288 regulatory sequence variation can modulates the timing of expression changes in response to 289 differentiation, similar to observations previously made in C. elegans using recombinant inbred 290 lines [19]. In other cases, the genetic effect coincides with high or low expression, for example 291 in the cases of IRF6 and AACS (Fig. 4C). These examples illustrate how genetic variation is 292 intimately linked to the dynamics of gene regulation. 293 294 We next asked whether dynamic eQTL were located in specific regulatory regions. To do this, 295 we evaluated the overlap of the epigenetic marks defined using the hESC differentiation time 296 series with the dynamic eQTL (Fig. 4D, S12). This revealed an enrichment of dynamic eQTL 297 in H3K27ac, H3K4me1 (i.e. enhancer marks), and H3K4me3 (i.e. promoter) marks compared 298 to non-dynamic eQTL (i.e. eQTL that we identified but did not display dynamic changes along 299 pseudotime, Fig. 4D), consistent with these SNPs being located in active regulatory elements. 300 301 Cellular environment modulates genetic effects on expression 302 Whilst differentiation was the main source of variation in the dataset, single cell RNA-seq 303 profiles can be used to characterize cell-toll-cell variation across a much wider range of cell 304 state dimensions [20-22]. We identified sets of genes that varied in a co-regulated manner 305 using clustering (affinity propagation; 8,000 most highly expressed genes; Table S8;  306 Methods), which identified 60 modules of co-expressed genes. The resulting modules were 307 enriched for key biological processes such as cell differentiation, cell cycle state (G1/S and 308 G2/M transitions), respiratory metabolism, and sterol biosynthesis (as defined by Gene 309 Ontology annotations; Table S9). These functional annotations were further supported by 310 transcription factor binding (e.g. enrichment of SMAD3 and E2F7 targets in the differentiation 311 and cell cycle modules, respectively (Table S10, S11)). Additionally, expression of the cell 312 differentiation module (cluster 6; Table S9) was correlated with pseudotime, as expected (R 313 = 0.62; Fig. S7). 314 315 316

336
Using the same ASE-based interaction test as applied to identify dynamic QTL, reflecting ASE 337 variation across pseudotime ( Fig. 4; Methods), we assessed how the genetic regulation of 338 gene expression responded to these cellular contexts. Briefly, we tested for genotype by 339 environment (GxE) interactions using a subset of four co-expression modules as markers of 340 cellular state, while accounting for pseudotime ( Fig. 5A; Methods). These four co-expression 341 modules were annotated based on GO term enrichment, and taken as markers representing 342 cell cycle state (G1/S and G2/M transitions) and metabolic pathway activity (respiratory 343 metabolism and sterol biosynthesis; Methods). This approach extends previous work using 344 ASE to discover GxE interactions [23,24], taking advantage of the resolution provided by 345 single-cell data. We identified 686 eQTL that had an interaction effect with at least one factor 346 ( Fig. 5B; FDR < 10%), with many of these effects being orthogonal to the effects of 347 differentiation. Indeed, 394 genes had no association with pseudotime, but responded to at 348 least one other factor. Conversely, of the 785 dynamic eQTL, 292 were also associated with 349 other factors, while 493 were associated only with pseudotime (Fig. 5B, S13 shows sensitivity to cellular respiratory metabolic state (Fig. 5A). This eQTL SNP is in strong 357 LD (r 2 = 1.0) with a GWAS risk variant for basal cell carcinoma [25]. Furthermore, an eQTL for 358 PPP1R14A showed sensitivity to the G1/S state, and is in LD (r 2 = 0.81) with a GWAS risk 359 variant for prostate cancer [26] (Fig. 5A).  Fig. 5B, 369 5C; Methods). This identified 220 genes with significant higher order interactions between a 370 genetic variant, differentiation, and at least one other factor (Fig. 5B, 5C, S13; Table S13d).

371
One example is the eQTL for RPS26, whose ASE was sensitive to cellular respiration, but 372 only late in differentiation (Fig. 5C) project were considered for analysis ( Table S1). Batches of 4-6 cell lines were co-cultured 396 and grown as a mixed population for a total of 28 experiments, in 12 well plates. Cells were 397 harvested immediately prior to the initiation of differentiation (day0; iPSCs), and at time points 398 1, 2, and 3 days post differentiation initiation (day1, day2, day3). Subsequently, single cells 399 were sorted into 384 well plates. Cells were processed using Smart-seq2 for scRNA-seq with 400 parallel FACS analysis of the markers TRA-1-60 and CXCR4 being performed for each cell. 401 A subset of cell lines were assayed in more than one experiment (33 donors; analyses if they had at least 50,000 counts from endogenous genes, at least 5,000 genes with 474 non-zero expression, less than 90% of counts came from the 100 highest-expressed genes, 475 less than 15% of reads mapping to mitochondrial (MT) genes, they had a Salmon mapping 476 rate of at least 60%, and if the cell was successfully assigned to a donor (Fig. S15). Dead 477 cells as identified based on 7AAD staining were discarded. Size factor normalization of counts 478 was performed using the scran Bioconductor package in R [44]. Expressed genes with an 479 HGNC symbol were retained for analysis, where expressed genes in each batch of samples 480 were defined based on i) raw count > 100 in at least one cell prior to QC and ii) average 481 log2(CPM+1) > 1 after QC. Normalized CPM data were log transformed (log2(CPM+1)) for all 482 downstream analyses. The joint dataset was investigated for outlying cell lines or experimental 483 batches, which identified no clear groups of outlying cells (Fig. S16, S17). 484 485 As a final QC assessment, we considered possible differences between cell lines from healthy 486 and diseased donors. In particular, a subset of 11 cell lines were derived from neonatal 487 diabetes patients, and differentiated together with cell lines from healthy donors across 7 488 experiments (out of 28). There was no detectable difference in differentiation capacity between 489 healthy and neonatal diabetes lines in these experiments (P>0.05), and cells from both sets 490 of donors overlapped in principal component space (Fig. S18). Thus, we included cells from 491 all donors in our analyses irrespective of disease state. 492 493 The final merged and QC'ed dataset consisted of 36,044 cells with expression profiles for 494 11,231 genes (Fig. S2). 495 Bulk RNA-Seq quality control and processing 496 Raw RNA-seq data for 546 HipSci cell-lines were obtained from the ENA project: ERP007111 497 and EGA projects: EGAS00001001137 and EGAS00001000593. CRAM files were merged 498 per cell-line and converted to FASTQ format. Processing of the merged FASTQ files was 499 matched to the single cell processing, as described above. Samples with low quality RNA-seq 500 were discarded based on the following criteria: lines with less than 2 billion bases aligned, with 501 less than 30% coding bases, or with a duplication rate higher than 75%. This resulted in 540 502 lines for analysis, 108 of which had matched (day0) single cell RNA-seq data available. 503 504 Gene-level expression levels were quantified using Salmon, analogously to the alignment, as 505 described for the single cells. Gene expression profiles were normalized using scran, to match 506 the single cell data processing, and the scran normalized CPM data is log transformed 507 (log2(CPM+1)). 508 Variance component analysis  Table S14. 514 Highly variable genes 515 The top highly variable genes were computed using scran's trendVar and decomposeVar 516 functions, using a design matrix to correct for the differentiation experiment-specific effects 517 (i.e. treating each experiment as a different batch). At FDR < 1%, this identified 4,546 highly 518 variable genes. 519 Pseudotime definition 520 We used the first principal component calculated based on the top 500 highly variable genes 521 in our set to represent differentiation pseudotime. This component was linearly re-scaled to 522 take values between 0 (the minimum value observed for any cell) and 1 (the highest value 523 observed). For comparison, we considered three alternative methods for defining pseudo time: 524 525 (i) We considered diffusion pseudotime (DPT) [46] (Fig. S7A). The underlying diffusion map 526 was generated using 15 nearest neighbours and with gene expression represented by the first 527 20 PCs across the top 500 most highly variable genes. DPT analysis was carried out using 528 the default settings with Scanpy v1.2.2 [47]. There was a Pearson correlation of 0.82 between 529 DPT and the pseudotime definition we used. 530 531 (ii) We considered calculating pseudotime by projecting each cell on to the principal curve of 532 the first two principal components of the top 500 most highly variable genes (Fig. S7B). 533 Principal curve analysis was performed using the R package princurve [48]. There was a 534 Pearson correlation of 0.86 between the principal curve pseudotime and the pseudotime 535 definition we used. 536 537 (iii) We considered representing pseudotime by the mean expression of the differentiation co-538 expression module. This gene cluster was enriched for GO terms associated with 539 differentiation including 'anatomical structure morphogenesis' (GO:0009653), 540 'anterior/posterior pattern specification' (GO:0009952), and 'response to BMP' (GO:0071772) 541 (Table S9; Fig. S7C). There was a Pearson correlation of 0.64 between the differentiation co-542 expression module and the pseudotime definition we used. The lower concordance between 543 pseudotime and this module is consistent with the limited set of genes included -the 544 coexpression module only includes genes upregulated during differentiation, and therefore 545 uses no information from changes in expression of pluripotency-associated genes. 546 547 Definition of mesendoderm and definitive endoderm populations 548 The stage labels post iPSC (mesendo and defendo) were defined using a combination of 549 differentiation stages obtained using the single-cell defined pseudotime and knowledge based 550 on canonical marker genes. Cells were assigned to the mesendo stage if they were collected 551 at day1 or day2, and had pseudotime values between 0.15 and 0.5, corresponding to a 552 pseudotime window around the peak expression of Brachyury (T), a marker of mesendoderm 553 (Fig. S8A). Cells were assigned to the defendo stage if they were collected at day2 or day3, 554 and had pseudotime values higher than 0.7, corresponding to a pseudotime window with 555 maximal expression of GATA6, a marker of definitive endoderm (Fig. S8B). Cells with 556 intermediate pseudotime (between 0.5 and 0.7) mostly came from day2, and were not 557 assigned to any stage for the purposes of the initial stage QTL mapping (results shown in Fig.  558 2). Overall, we assign 28,971 (80%) cells to any of the stages (iPSC, mesendo, defendo). 559 Identification of genetic and molecular markers for differentiation 560 efficiency 561 Differentiation efficiency for each cell line was defined as its average pseudotime across cells 562 at day3, quantified for each experiment and unique donor. To test for associations with 563 molecular markers, we considered stage-specific gene expression levels, again quantified for 564 each donor and experiment (as log2(CPM + 1)). 565 566 Three sets of tests were performed. In each case, models were fitted using the lme4 package 567 in R [49], and significance was determined by the Likelihood ratio test. The tested model was: Where Experiment is a random effect grouping sets of samples from the same experiment, 572 and Donor is a random effect grouping samples from the same donor (and cell line). Two sets 573 of Markers were tested -genetic markers (i.e. eQTL SNPs), and expression markers (i.e. 574 expression levels in the iPSC stage/day0), and are presented in Table S6, genes with an eQTL respectively (FDR<10%). eQTL results are provided in Table S3).  632  633 For comparison, we performed analogous QTL analyses using all cells from day1, and day3 634 instead of the pseudo-time based differentiation stages. This approach resulted in 1,181 635 (10,787 tested) and 631 (10,765 tested) genes with an eQTL at day 1 and 3 respectively 636 (Table S5). 637 Mapping dynamic eQTL (visualisation purposes only) 638 We performed eQTL mapping across a sliding window on pseudotime, considering bins that 639 contain 25% of all cells, sliding along the pseudotime by a step of 2.5% of cells (Fig. 4A quantifications for each SNP were converted from "alternative allele reads" to "chrB allele 681 reads" using the known phase (indicated as chrA|chrB, where 0=reference, 1=alternative) of 682 each SNP in each donor (e.g. for a SNP with the phase "1|0", the alternative allele is on chrA, 683 so the number of reads mapping to chrB = number of reference allele reads = total number of 684 reads -number of alternative allele reads). Thus, for each cell, ASE for all SNPs was quantified 685 relative to the genotypes of the chromosomes of that individual, rather than to "reference" or 686 "alternative" alleles.
(3) Aggregation of ASE from SNP-level to gene-level. For each gene, this 687 was done by summing the "chrB allele reads" and "total reads" across all SNPs contained in 688 the exons of that gene (as described in the ENSEMBL 75 GTF file). (4) Conversion of 689 quantifications from "chrB allele reads" to "reads from the chromosome containing the 690 alternative allele of the eQTL SNP", again by using the available phasing information. For each 691 eQTL (i.e. each gene-SNP pair), this provides a consistent definition of ASE across all cells 692 heterozygous for the eQTL SNP (i.e. across cells from multiple donors). Donors that are not 693 heterozygous at the eQTL variant of interest were not used for quantification. (5) Conversion 694 to allelic fractions i.e. quantifications express the allelic reads as a fraction of the total number 695 of reads. 696 697 ASE association tests with cellular factors 698 ASE quantifies the relative expression of one allele over the other. If one of these alleles is 699 more responsive to a particular environmental factor (e.g. because of preferential transcription 700 factor binding), then ASE is expected to vary systematically with that factor. This observation 701 has previously been used to identify GxE interactions in gene expression across individuals 702 [23]. Here, we applied similar concepts to single-cell RNA-seq, testing for the influence of 703 cellular environmental factors (i.e. cellular processes) on ASE in individual cells. Importantly, 704 these ASE tests are "internally matched", as potentially confounding batch effects and 705 technical variation affect both alleles in each cell similarly. 706 707 Five sets of tests were performed, in a linear modelling framework (Fig. 5, S13; Tables S13): 708 709 (1) Linear pseudotime ("pseudo") tests. The ASE of each gene-eQTL pair was tested for 710 association with pseudotime, across all cells in which ASE was quantified for that pair: 711 712 = + 713 714 (2) Quadratic pseudotime tests. As (1), but with linear pseudotime as a covariate: 715 716 = + + 717 718 (3) Linear cellular factor test. As (1), but with each of 4 cellular factors ("factor") (respiratory 719 metabolism, sterol biosynthesis, G1/S transition and G2/M transition): 720 721 = + 722 723 (4) Pseudotime-corrected linear cellular factor test. As (3), but with pseudotime included as a 724 covariate: 725 726 = + + 727 728 (5) Combined pseudotime-factor test. As (4), but testing for the additional effect of (pseudotime 729 x factor) included as a covariate: 730 731 = + + ( × ) + 732 733 In each case, tests were only performed for eQTL for which ASE was quantified in at least 500 734 cells. Tests were performed using the statsmodels package in Python (likelihood ratio test  Fig. 4C), two calculations were performed. First, 746 the mean expression of each gene across the pseudotime bins was calculated using all cells 747 heterozygous for the eQTL SNP (i.e. the cells in which ASE was quantified). The expression 748 of each allele in each pseudotime bin was then calculated by taking the mean ASE +/-SEM, 749 multiplied by the mean expression of that gene (in CPM) in that bin. 750