Transcriptional signatures of schizophrenia in hiPSC-derived NPCs and neurons are concordant with signatures from post mortem adult brains

Whereas highly penetrant variants have proven well-suited to human induced pluripotent stem cell (hiPSC)-based models, the power of hiPSC-based studies to resolve the much smaller effects of common variants within the size of cohorts that can be realistically assembled remains uncertain. In developing a large case/control schizophrenia (SZ) hiPSC-derived cohort of neural progenitor cells and neurons, we identified and accounted for a variety of technical and biological sources of variation. Reducing the stochastic effects of the differentiation process by correcting for cell type composition boosted the SZ signal in hiPSC-based models and increased the concordance with post mortem datasets. Because this concordance was strongest in hiPSC-neurons, it suggests that this cell type may better model genetic risk for SZ. We predict a growing convergence between hiPSC and post mortem studies as both approaches expand to larger cohort sizes. For studies of complex genetic disorders, to maximize the power of hiPSC cohorts currently feasible, in most cases and whenever possible, we recommend expanding the number of individuals even at the expense of the number of replicate hiPSC clones.


111
We used an integration free approach to generate genetically unmanipulated 112 hiPSCs from COS patients (14 of 16 patients, 88% reprogrammed) and unrelated 113 age-and sex-matched controls (12 of 12 controls, 100% reprogrammed) (Fig.   114   1B). Briefly, primary fibroblasts were reprogrammed by sendai viral delivery of 115 KLF4, OCT4, SOX2 and cMYC; presumably clonal lines were picked and 116 expanded 23-30 days following transduction. Following extensive 117 immunohistochemistry, florescent activated cell sorting (FACS), quantitative 118 polymerase chain reaction (qPCR) and karyotype assays to assess the quality of 119 the hiPSCs (Fig. 1B,E,F), we selected two to three presumably clonal hiPSC 120 lines per individual (n=40 COS, n=35 control,   (Table 1; SI Table   131 2). We 11,13,33 and others 1,19,34 have previously demonstrated that hiPSC-NPCs  Table 2 for available batch information). Only validated hiPSC-NPCs that 149 yielded high quality populations of matched hiPSC-NPCs and hiPSC-neurons in 150 one of three batches of thaws were used for RNA-Seq (SI Table 1,2).  (Table 1; SI Table 2). The median number of 156 uniquely mapped read pairs per sample was 42.7 million, of which only a very 157 small fraction were rRNA reads (SI Fig. 1; SI Table 3). 18,910 genes (based on 158 ENSEMBL v70 annotations) were expressed at levels deemed sufficient for 159 analysis (at least 1 CPM in at least 30% of samples); 11,681 were protein coding, 160 879 were lincRNA, and the remaining were of various biotypes (SI Table 4).

162
Since six COS patients were selected based on CNV status, we examined gene 163 expression in the regions affected by the CNVs. Despite the noise inherent to 164 RNA-Seq and the high level of biologically driven expression variation in samples 165 without CNVs, we identified corresponding hiPSC-NPC and neuron expression 166 changes in some CNV regions (SI Fig. 2).

168
In addition to SZ diagnosis-dependent effects, gene expression between hiPSC-169 NPCs and hiPSC-neurons was expected to vary as a result of technical 35 , 170 epigenetic 36-38 and genetic 39-41 differences 42 . Unexpectedly, we also observed 171 substantial variation in cell type composition (CTC) between populations of 172 hiPSC-NPCs and hiPSC-neurons. In the following sections, we discuss our 173 strategy to address these sources of variation (Figs. 2-4).

175
Addressing technical variation in hiPSC-NPC and neuron RNA-Seq data 176 177 We implemented an extensive quality control pipeline to detect, minimize and 178 account for many possible sources of technical variation (Fig. 1I). Samples were   2768 genes correlated with Sendai expression at FDR < 5% (SI Table 5). We 218 note that this signal is not driven by outliers since quantile normalized Sendai 219 expression values were used in this analysis. In fact, these genes are highly 220 enriched for targets of MYC (OR = 3.75, p < 6.4e-38) (SI Table 6   To assess the similarity of our hiPSC-NPCs and hiPSC-neurons to other hiPSC 241 studies (by ourselves and others), as well as to post mortem brain, we compared 242 our dataset to publically available hiPSC, hiPSC-derived NPCs/neurons, and 243 post mortem brain homogenate expression data sets (Fig. 2). Hierarchical 244 clustering indicated that similarity in expression profiles is largely determined by 245 cell type ( Fig. 2A). hiPSC-NPC and hiPSC-neuron datasets were more similar to cluster with hiPSC-NPCs and hiPSC-neurons, respectively, generated previously 252 in the same lab 10,12 and with hiPSC-NPCs and hiPSC-neurons from others 14 , 253 although some hiPSC-neurons 19 are more similar to prenatal brain samples from 254 multiple brain regions 49 . Consistent with a differentiation paradigm from hiPSC 255 to NPC to neuron, multidimensional scaling analysis (Fig. 2B) indicated that 256 hiPSC-NPCs more resemble hiPSCs / hESCs than do hiPSC-neurons.

258
Genome-wide, hiPSC-NPCs and hiPSC-neurons express a common set of 259 genes, so that expression differences between these cell types are driven by Given the substantial variability we observed between hiPSC-NPCs and hiPSC-271 neurons, even from the same individual (SI Fig. 8), it seemed likely that inter- neurons, but also inhibitory and rare dopaminergic neurons, as well as astrocytes 277 11 . We hypothesized that CTC could be inferred using existing single cell RNA-278 Seq datasets and would enable us to (partially) correct for variation in 279 differentiation efficiencies and account for some of the intra-individual expression 280 variation.

282
Bulk RNA-Seq analysis reflects multiple constituent cell types; therefore, we 283 performed computational deconvolution analysis using CIBERSORT 51 to 284 estimate CTC scores for each hiPSC-NPC and hiPSC-neuron sample (Fig. 3). 285 A reference panel of single cell sequencing data from mouse brain 52 , mouse cell 286 culture of single neural cells 53 and bulk RNA-Seq from hiPSC 50 was applied. samples had a higher neuron CTC score than hiPSC-NPCs (Fig. 3A), while 293 hiPSC-NPCs had a higher hiPSC CTC score, consistent with a "stemness" signal 294 (a neural stem cell profile was lacking from our reference) (Fig. 3B). 295 Unexpectedly, hiPSC-NPCs had a higher fibroblast 1 score (Fig. 3C). Rather than 296 imply that there are functional fibroblasts within the hiPSC-NPC populations, we 297 instead posit that this fibroblast signature is instead marking a subset of 298 unpatterned, potentially non-neuronal cells 53 . Analysis of external NPC and 299 neuron datasets indicates that these observations were reproducible, although 300 there is substantial variability in CTC scores across datasets (SI Fig. 9). 301 Although correction for CTC improved our ability to distinguish hiPSC-NPC and

306
The effect of CTC heterogeneity, likely due to the variation in differentiation 307 efficiency, can be reduced by including multiple CTC scores in a regression 308 model and computing the residuals.
Using an unbiased strategy, we  (Fig. 4). For each gene we calculated the percentage of 328 expression variation attributable to cell type, donor, diagnosis, sex, as well as 329 CTC scores for both fibroblast sets. All remaining expression variation not 330 attributable to these factors was termed residual variation. The influence of each 331 factor varies widely across genes; while expression variation in some genes is 332 attributable to cell type, other genes are affected by multiple factors (Fig. 4A). 333 Overall, and consistent with the separation of hiPSC-NPCs and hiPSC-neurons 334 by the first PC, cell type has the largest genome-wide effect and explained a 335 median of 13.3% of the observed expression variation (Fig. 4B) Critically, there is no apparent diagnosis dependent variation in CTC (SI Fig. 12).

354
By compensating for CTC, we prevent variation in neuronal differentiation 355 between hiPSCs from overriding some of the donor-specific gene expression 356 signature that is the central focus of patient-derived cell culture models.

358
The percentage of expression variation explained by each factor has a specific 359 biological interpretation. PRRX1 is known to function in fibroblasts 56,57 and 360 variation in the fibroblast 1 CTC score explains 38.3% of expression variant in this 361 gene (Fig. 4C). Expression of CNTC4 is driven by an eQTL in brain tissue that 362 corresponds a risk locus for schizophrenia 48 . In our data, CNTC4 has 67.4% 363 expression variation across donors suggesting that this variation is driven by 364 genetics (Fig. 4D). Genes that vary across diagnosis correspond to differentially 365 expressed genes, including FZD6, a WNT signaling gene linked to depression 58 , 366 ( Fig. 4E) and QPCT, a pituitary glutaminyl-peptide cyclotransferase that has 367 been previously associated with SZ 59 (Fig. 4F).

369
Genes that vary across donors were enriched for eQTLs detected in post mortem 370 brain tissue 48 (Fig. 4G) Genes with similar functions are known to share regulatory mechanisms and so 383 are often coexpressed 60 . We used weighted gene coexpression network 384 analysis (WGCNA) 61 to identify modules of genes with shared expression 385 patterns (Fig. 5, SI Table 7). Genes were clustered into modules of a minimum 386 of 20 genes, and each module was labeled with a color (SI Fig. 13). Genes that 387 did not form strong clusters were assigned to the grey module. Analysis was 388 performed separately in hiPSC-NPCs and hiPSC-neurons; each module was 389 evaluated for enrichment of genes for multiple biological processes. Many

424
While plausible candidates such as FZD6 and QPCT were differentially 425 expressed, gene set enrichment testing did not implicate a coherent set of 426 pathways (SI Table 9). Since SZ is a highly polygenic disease 67,68 and this hiPSC-neurons, which showed remarkably similar log 2 fold changes (Fig. 6C). 436 Moreover, no genes had log 2 fold changes that were statistically different in the 437 two cell types, although we were underpowered to detect such differences.

439
Overall, our differential expression analysis demonstrated that case-control statistics from the differential expression analysis between cases and controls.

460
The Spearman correlation between our hiPSC-NPC results and the CMC and 461 HBCC results were 0.108 and 0.0661, respectively; for the hiPSC-neurons 462 results, the correlations were 0.134 and 0.0896, respectively (Fig. 6D, SI Fig. 14-463 15). These correlations were highly statistically significant (Fig. 6E) Fig. 14-15). This stronger concordance of hiPSC-neurons (relative to hiPSC- While the concordance with CMC was observed when correcting for any set of 475 CTC scores (or none), the concordance with HBCC was only apparent when 476 correcting for the fibroblast 1 CTC score (SI Fig. 16). This illustrates the 477 importance of accounting for CTC and the fact that concordance can be 478 obscured by biological sources of expression variation. The genes for which the 479 differential expression signal was boosted by accounting for the fibroblast 1 score 480 were enriched for brain and synaptic genesets, including specific biological 481 functions such as FMRP and mGluR5 targets (SI Fig. 17,18).

483
This result indicates that although the concordance between hiPSC-NPCs and  Given the challenges of low statistical power, substantial intra-donor variation,  (Fig. 7). When the cost for each donor and each additional replicate are 545 equal, adding an additional donor will increase the ESS by one unit (Fig. 7A), 546 while adding an additional sample from an existing donor will increase the ESS including additional biological replicates (Fig. 7C,D) In addition to maximizing cohort ESS, future studies will benefit from decreasing 566 intra-donor expression variation by optimizing neuronal differentiation/induction 567 protocols to focus on decreasing cellular heterogeneity (rather than increasing 568 14 total yield). The generation of single cell sequencing datasets from hiPSC-NPCs 569 and/or hiPSC-neurons will further yield a custom reference panel with which to 570 improve CTC deconvolution. In fact, our results suggest that to maximize ESS 571 while minimizing associated costs, it may be sufficient to focus on hiPSC-NPCs 572 rather than hiPSC-neurons. Given our improved understanding of the challenges 573 associated with studying highly polygenic diseases as well as the biological 574 constraints encountered here, disease signal will be further improved by reducing 575 disease heterogeneity through focusing on cohorts of patients with shared 576 genetic variants and/or the genetic engineering of isogenic hiPSC lines to 577 introduce or repair SZ-relevant variants.