Data integration of bulk and single-cell transcriptomics from cerebral organoids and post-mortem brains to identify cell types and cell type specific driver genes in autism


 Human-derived cerebral organoids demonstrate great promise for identifying cell types and cell type specific molecular processes perturbed by genetic variants associated with neuropsychiatric and neurodevelopmental disorders, which are notoriously challenging to study using animal models. However, considerable challenges remain in achieving robust, scalable and generalizable phenotyping of organoids to discover cell types and cell type specific genes. We perform RNA sequencing on 71 samples comprising 1,420 cerebral organoids from 25 donors, and describe a framework (Orgo-Seq) to integrate bulk RNA and single-cell RNA sequence data from human post-mortem brains and cerebral organoids, for the identification of cell types and cell type specific individual genes. We apply Orgo-Seq for two autism-associated loci: 16p11.2 deletions and 15q11-13 duplications, and identify neuroepithelial cells as critical cell types for 16p11.2 deletions, and discover novel and previously reported cell type specific driver genes. Finally, we validated our results that mutations in the KCTD13 gene in the 16p11.2 locus lead to imbalances in the proportion of neuroepithelial cells, using CRISPR/Cas9-edited mosaic organoids. Our work presents a quantitative technological framework to integrate multiple transcriptomics datasets to identify cell types and cell type specific driver genes associated with complex diseases using cerebral organoids.


DNA was extracted from the iPSCs (Supplementary
). All controls were confirmed not to harbor any CNVs within the two 123 ASD-associated loci in 16p11.2 and 15q11-13.

124
We differentiated cerebral organoids using the 25 iPSCs for 46 days, by adapting a 125 previously described method 12 (Table 1, Supplementary Fig. 1). To reduce batch effects and 126 biases in cell type compositions that was previously reported in individual organoids 1 , we 127 differentiated individual organoids in single wells of 24-well plates on an orbital shaker, and pooled 128 20 individual organoids for the same donor from different wells across different plates into a single 129 replicate. We performed RNA sequencing on 1 to 3 replicates for each donor, resulting in a total 130 of 71 samples (Supplementary Table 1).

Page 7
We compared the standard deviations in gene expression between replicates for each 132 individual (intra-individual), as well as across organoids differentiated from different individuals 133 (inter-individual). We found that there were 860 genes (7.6% of all expressed genes) that showed 134 high intra-individual variability, and 869 genes (7.7% of all expressed genes) that showed high 135 inter-individual variability (Supplementary Fig. 2A-B). These genes with high intra-individual or 136 inter-individual variability were enriched in processes involved in nervous system development,

147
2C-H). Similar to a previous report 5 , we observed significantly higher intra-individual correlations 148 compared to inter-individual correlations (Wilcoxon P=1.03×10 -7 ), confirming that bulk RNA 149 sequence data from the cerebral organoids can reflect biological differences between individuals 150 that are not due to technical differences between replicates differentiated from the same 151 individual.

152
We used variancePartition to identify potential drivers of variation in the RNA sequence 153 data from the organoids 15 , and found that most of the variation in the data was unaccounted for 154 by 8 sample variables: ethnicity, sex, age, origin of sample used for iPSC reprogramming, type of 155 reprogramming, center that distributed the iPSC line, ASD diagnosis and CNV genotype Page 8 (Supplementary Fig. 3A). Gene ontology enrichment of the genes with >99% variance explained 157 by the residuals showed that these genes are enriched in the mitochondrial envelope 158 (Supplementary Fig. 3B).

159
Principal components analyses on all case and control samples showed that most of the 160 variance in gene expression (88%) can be accounted for by the first principal component (PC1) 161 alone (Supplementary Figs. 4A-G). We further observed that age, origin of sample and the type 162 of reprogramming are significantly correlated with PC1 alone, but not with the second and third 163 principal components (Supplementary Fig. 5). These results suggest that PC1 is a surrogate 164 variable for age, origin of sample and type of reprogramming, and subsequently, we included PC1 165 as a covariate in the differential expression analyses.

167
Transcriptome data in cerebral organoids accurately reflect copy number changes 168 It had been previously reported that bulk RNA sequence data from the cerebral organoids 169 are highly correlated with bulk RNA sequence data from fetal brains 5 , and we similarly observed 16p11.2 deletions (whom we termed as "probands"), 6 individuals with 16p11.2 deletions but are 177 not clinically diagnosed with ASD (whom we termed as "resilient" individuals), and 12 control 178 unaffected individuals without 16p11.2 deletions ( Table 1). For differential expression analyses, 179 we used linear regression with the first principal component as a covariate, and performed 180 multiple hypotheses correction using the Benjamini-Hochberg false discovery rate (FDR). We

Page 9
further checked the first 2 principal components, but did not observe major stratification between 182 the cases and controls (Supplementary Fig. 7). 183 We performed three sets of differential expression analyses on RNA sequence data from   duplications, only one copy of these genes on their paternal chromosomes is expressed.

233
Therefore, when comparing the expression of these imprinted genes in the ASD probands with 234 controls, we expect to observe a fold change of ~1 for these 5 genes.

235
Out of the 3 genes with fold changes of >1 but are not significantly differentially expressed 236 (OCA2, NIPA1, GABRB3), both OCA2 and NIPA1 have been reported to show biallelic expression 237 from both paternal and maternal chromosomes. A previous literature that looked at post-mortem 238 brain samples from individuals with 15q11-13 duplications found that there is no increased 239 expression of GABRB3 in the brains despite the increased copy number 19 , showing that there is 240 possibly silencing of the maternal copy of GABRB3 in postmortem human brain samples.

241
We did not detect smaller duplications that might encompass only a subset of these genes 242 in the 15q11-13 locus for these individuals with ASD, using aCGH and whole-exome sequencing 243 (Supplementary Tables 2-3

254
Of the 9 genes that are differentially expressed, 3 of them (TUBGCP5, CYFIP1 and HERC2) are

269
recent studies in mice and zebrafish with deleted KCTD13 did not observe increased brain sizes 270 or neurogenesis in these mutant animal models 31,32 . In the absence of human fetal brains with 271 16p11.2 deletions that could be used to resolve these conflicting results from animal models 33 , 272 the use of donor-derived cerebral organoids could be good models to provide supporting results.

273
We obtained scRNA-seq data from a recent publication that had found 10 major clusters

280
Using these genes as a reference panel for defining cell types in cerebral organoids, we 281 evaluated if the differentially expressed genes identified from our bulk RNA sequence data 282 between the cases and controls, are preferentially enriched for cell type specific genes in any of 283 the 10 cell types (Fig. 1). We developed a statistic termed CellScore, which is the difference

Page 13
between the weighted sum of all cell type specific genes and the weighted sum of all non-cell type 285 specific genes for each cluster of cell types, and the weights are the -log 10 (P-values) from our 286 differential expression results in cerebral organoids. This allows us to identify transcriptomic 287 signatures arising from the cell type specific genes for each cluster, rather than the non-cell type 288 specific genes contributing to multiple clusters. Next, we evaluated the significance of our 289 observed CellScores using permutations (Supplementary Fig. 8).

290
When we applied the CellScore evaluation to the differential expression results from the

297
When we applied the CellScore evaluation to the differential expression results from the 298 15q11-13 organoids, we found that there were no cell clusters that were significantly perturbed 299 with a threshold of P(CellScore) ≤ 0.01, although the top cell cluster identified was the stem cell

Page 14
We previously engineered reciprocal deletion and duplication of 16p11.2 in an isogenic

320
Similarly, we observed that 9,500 genes were expressed in both the neural stem cells with

357
Similar to the identification of driver versus passenger genes in cancers, it has been challenging 358 to identify which of the genes in these ASD-associated CNV loci are more likely to be driver genes.

Page 16
The prioritization of candidate driver genes, or combinations of genes, is important for follow-up 360 studies, for instance, to create single-gene knockouts in animal models or organoids for 361 understanding the biological effects of knockouts in these genes 29 .

368
In the 15q11-13 locus encompassing 11 genes, several studies have identified UBE3A as 369 the major causal gene for ASD 41,42 . Although there is supporting evidence for other candidate 370 causal genes such as CYFIP1 and HERC2 in the locus 43,44 , there is also evidence supporting that 371 CYFIP1 is not a causal gene in the locus 45 . Whole-exome sequencing on the iPSCs from one of 372 the ASD probands with 15q11-13 duplication (901) and her unaffected mother (902) showed that 373 they harbored a rare stop-gained mutation (p.Q3441X) in HERC2, which is one of the genes in 374 the 15q11-13 locus.

375
The expression for the genes in the 16p11.2 and 15q11-13 loci range from the 1.8 th to 91 st 376 percentiles detected from bulk RNA sequencing (Supplementary Table 14), and the expression 377 for most of these genes cannot be detected from sequencing a relatively small number of cells 378 using scRNA-seq 1 . We hypothesized that bulk RNA sequence from the patient-derived cerebral 379 organoids can be harnessed to identify candidate driver genes in these CNV loci. One approach 380 for evaluating the functional effects of genes is to quantify the effects of transcriptomic 381 perturbations in the cerebral organoids. Our assumption is that candidate driver genes are likely 382 to result in more perturbations in downstream genes than the candidate passenger genes. To 383 identify downstream targets of each gene in an unbiased manner, we first calculated the Page 17 from RNA sequencing in the BrainSpan Project, and used the correlations in expression from the 386 BrainSpan Project as a proxy for co-expression connectivity with our genes of interest. Next, we 387 developed a statistical method termed GeneScore, which is a weighted sum of the co-expression 388 connectivity, and the weights are the -log 10 (P-values) from our differential expression analyses.

389
As a normalization factor, we used the genomic control, which is the ratio of the observed median 390 to the expected median test statistic 46 .

391
Among the 22 genes in the 16p11.2 locus that are expressed in cerebral organoids, 20 of 392 these genes are also expressed in post-mortem brain samples from the BrainSpan Project.

393
Similarly, when we calculated GeneScores all using all genes detected from the 16p11.2 organoid 394 RNA sequence data, we found that we were unable to prioritize any of the 11 genes in the 16p11.2 confidence candidate driver genes with FDR≤0.1 in cluster 9. Interestingly, we did not find any 413 high-confidence candidate gene with FDR≤0.05 in cluster 6, and only 1 lower confidence 414 candidate gene with FDR≤0.1 in cluster 6 (YPEL3).

415
One of the 3 high-confidence driver genes in cluster 9 (KCTD13) was initially implicated 416 as a gene that modulated brain sizes in zebrafish 30 , but recent studies using KCTD13-deficient 417 mice and zebrafish did not observe any differences in brain sizes or neurogenesis 31,32 . Through 418 the Orgo-Seq framework on patient-derived cerebral organoids, we found that KCTD13 is one of

Page 19
Given that cluster 10 comprising of mainly stem cells was the top prioritized cell type for 437 the 15q11-13 locus, we adapted our GeneScore calculations to include only cell type specific 438 genes that were identified in cluster 10, and found HERC2 and TUBGCP5 as significant candidate 439 driver genes at FDR≤0.1 (Fig. 3D, Supplementary Tables 15-16). Homozygous missense and

Page 20
This validation approach allows us to test our observed results from RNA sequence data 463 obtained from the patient-derived cerebral organoids using an orthogonal approach with cell type 464 specific protein markers on the CRISPR-edited mosaic cerebral organoids. If KCTD13 mutations 465 do not affect a specific cell population, we expect to observe that the mutations are not    the same conditions. It is interesting to note that we were unable to prioritize any candidate driver 527 genes when using co-expression patterns of all genes whose expression were detected in the 528 cerebral organoids, but we were able to nominate genes that appeared to drive co-expression 529 patterns within specific cell types, emphasizing the power of evaluating cell type specificity 65 .

530
These approaches will become increasingly valuable in cross-disorder studies where etiological 531 overlap has been identified, such as in neuropsychiatric disorders. Cerebral organoids can be 532 powerful model systems to evaluate cell type specific commonalities in disease processes using 533 a genotype-driven approach.

534
A major strength of using donor-derived organoids for discoveries is that the donor-derived 535 organoids can model the diverse genetic backgrounds found in humans, and overcome some of 536 the limitations faced with using isogenic iPSC derivatives or inbred animal models. As such, it will Page 23 be increasingly important to develop technologies and methods that enable unbiased high-538 throughput discoveries using donor-derived organoids, to leverage on the unperturbed complexity 539 of human genetics for making important discoveries in disease biology.

540
In our work, we describe the Orgo-Seq framework to allow the identification of cell types 541 and cell type specific driver genes from donor-derived cerebral organoids that are important in

550
In addition, as high-quality scRNA-seq data are generated from increasingly large 551 numbers of single cells 68 , or scRNA-seq data are generated using new spatial-informative 552 technologies 69,70 , the Orgo-Seq framework allows us to integrate new scRNA-seq data with the 553 bulk RNA sequence data that we had already generated from our donor-derived organoids to 554 make new discoveries about cell types and cell type specific driver genes. The framework can 555 also be generalized for identifying cell types and cell type specific driver genes using bulk RNA 556 sequence data that had been generated from human post-mortem brains, without the need to 557 perform scRNA-seq directly on post-mortem brain samples with limited availability.

558
In our current study, we found that we were able to observe transcriptomic differences that 559 can shed insights into the critical cell types and cell type specific processes that are important in 560 neurodevelopmental and neuropsychiatric disorders such as ASD using early 46-day-old cerebral 561 organoids. Prior work demonstrated that even in these early 1-2 month-old cerebral organoids, 562 there are robust transcriptomic and cellular differences that could be detected for

Page 24
neurodegenerative diseases such as Alzheimer's Disease 71,72 . It will be interesting use Orgo-Seq 564 to integrate additional scRNA-seq data from cerebral organoids across developmental timepoints 565 or human post-mortem brain tissue to obtain new insights into the disease biology of 566 neurodevelopmental and neurodegenerative diseases.

567
We have also demonstrated a validation approach to rapidly create mosaic cerebral     Table 4).

Page 27
We adapted our cerebral organoid differentiation protocol according to a previously 631 described protocol 2 (Supplementary Fig. 1A). For embryoid body formation, cells were counted 632 using an automated cell counter and 900,000 iPSCs were re-suspended in 15ml of mTeSR

696
After selecting the transcript with the highest mean FPKM across all samples (including 697 all cases and controls) for each gene, there were 25,727 unique transcripts or genes. We further 698 performed quality control to remove genes that were not expressed, or had high intra-individual 699 or inter-individual variance. Genes that were not expressed in the cerebral organoids (mean 700 FPKMs across all samples < 2) were removed, resulting in a smaller set of 11,300 genes. We

709
With every technology or system, there are some measurements that will be made below 710 the background noise, or below the technical sensitivity of the system. These measurements are 711 usually not relied upon because there is low confidence in the accuracy of the measurements.

712
Similarly, we identified some genes from the bulk RNA sequence data that are highly variable in 713 their expression, and we cannot confidently estimate the expression of these genes using our The BrainSpan project (http://www.brainspan.org) provides a high-resolution map of 727 22,326 genes detected using RNA sequencing on 578 post-mortem brain samples from various 728 brain regions in prenatal brains (8 pcw) to adult brains (40 years old) 11 . We downloaded the "RNA-729 Seq Gencode v10 summarized to genes" dataset from the BrainSpan Project for our analyses 730 (http://www.brainspan.org/static/download.html). For comparing RNA sequence data from 731 prenatal brain samples from the BrainSpan Project with RNA sequence data from cerebral 732 organoids, we included only brain regions where more than 50% of samples were available for Page 31 those regions (≥9 samples). We performed two-sided Wilcoxon rank-sum test to evaluate if the 734 mean Pearson's correlations between the organoids and prenatal brain samples were significantly 735 higher than the mean Pearson's correlations between the organoids and postnatal brain samples.

736
We further calculated Pearson's correlations for each pair of genes from the BrainSpan RNA 737 sequence data.

738
We observed that after removing highly variable genes, the Pearson's correlations 739 between RNA sequence data from the organoids with the 578 post-mortem brain samples ranged 740 from 0.21 to 0.82 (Supplementary Fig. 6A). Prior to removing highly variable genes, the 741 Pearson's correlations between RNA sequence data from the organoids with the post-mortem 742 brain samples ranged from 0.14 to 0.93 (Supplementary Fig. 6D). The larger variance in 743 correlations prior to removing highly variable genes was primarily driven by high outlier  (Supplementary Fig. 4A). We  Fig. 7). Given that the inter-individual correlations observed between samples

782
We permuted the case-control status of each organoid replicate to obtain null distributions.

783
However, given the relatively small numbers of samples, we wanted to avoid creating permuted

Page 33
instances where the permuted cases are actual case samples and the permuted controls are 785 actual control samples. Differential expression analyses on these permuted instances will result 786 in the detection of true biological differences, instead of creating a baseline non-biological 787 measurement for the null distribution. As such, we developed a permutation strategy by sampling 788 permuted case samples from the actual control samples only (Supplementary Fig. 8).

789
Furthermore, to ensure that we have the same numbers of case and control samples in our

Page 34
There are 10 major clusters of cell types identified using unbiased clustering on scRNA-811 seq data from organoids, and each cell cluster has an associated list of genes identified using 812 Drop-seq, and was assigned a cell cluster identity using previously published data from 813 homogeneous cell populations 1 . We downloaded the data from Quadrato et al., and observed that 814 in these full lists of cluster genes, there are some genes that are present in multiple cell clusters,

815
and that these genes are not cell type specific. To enrich for cell type specific genes, we further 816 identified a smaller subset of genes that are uniquely found in each cell type cluster but are not 817 present in other cell type clusters, which we termed as "cell type specific genes" (Supplementary 818   Table 11). We termed the genes that are found in multiple cell clusters as "non-cell type specific We obtained a null distribution for CellScore by performing 100,000 permutations (see

923
Analysis of post-mortem brain samples with 15q11-13 duplications

924
The differential expression results on post-mortem brain samples from the cortex with or 925 without 15q11-13 duplications were downloaded from a prior publication 33 . We calculated

926
CellScores for each of the 10 cell type clusters by using the P-values from the differential 927 expression results for each gene. To calculate P(CellScore), we compared the observed

928
CellScore from the post-mortem brain samples against the null CellScore distributions for each of 929 the 10 cell type clusters generated by the permutations using the expression data from the 930 cerebral organoids with 15q11-13 duplications, accounting for the precise numbers of genes used 931 in the calculations of CellScores from the post-mortem brain samples. We calculated a weighted 932 average P-value for the results from the cerebral organoids with 15q11-13 duplications and the 933 results from the post-mortem brain samples with 15q11-13 duplications (Supplementary Table  Page 39 13), which allows us to evaluate the combined CellScore results from the cerebral organoids and 935 the post-mortem brain samples.

955
We obtained a null distribution for GeneScore by performing 100,000 permutations 956 (Supplementary Fig. 8), and performed linear regressions on the expression data for each 957 permutation. Next, we calculated GeneScore for each gene based on the permuted linear

967
To evaluate the cell type specific GeneScores, we used the differential expression results 968 from the same 100,000 permutations and calculated cell type specific GeneScore using only the 969 specific and non-specific genes in each cell cluster (c1-c10). To estimate the FDR for the cell type The iPSCs used for CRISPR/Cas9-editing were from an unaffected control individual 977 (PGP1). iPSCs were passaged until they were 50-75% confluent prior to nucleofection.

978
Nucleofection was performed using the HSC-1 kit and B-016 protocol on an Amaxa nucleofector.

979
Four gRNAs for KCTD13 and Cas9 protein were ordered from Synthego, and we evaluated the  Fig. 9).