Fine-grained cell-type specific association studies with human bulk brain data using a large single-nucleus RNA sequencing based reference panel

Brain disorders are leading causes of disability worldwide. Gene expression studies provide promising opportunities to better understand their etiology but it is critical that expression is studied on a cell-type level. Cell-type specific association studies can be performed with bulk expression data using statistical methods that capitalize on cell-type proportions estimated with the help of a reference panel. To create a fine-grained reference panel for the human prefrontal cortex, we performed an integrated analysis of the seven largest single nucleus RNA-seq studies. Our panel included 17 cell-types that were robustly detected across all studies, subregions of the prefrontal cortex, and sex and age groups. To estimate the cell-type proportions, we used an empirical Bayes estimator that substantially outperformed three estimators recommended previously after a comprehensive evaluation of methods to estimate cell-type proportions from brain transcriptome data. This is important as being able to precisely estimate the cell-type proportions may avoid unreliable results in downstream analyses particularly for the multiple cell-types that had low abundances. Transcriptome-wide association studies performed with permuted bulk expression data showed that it is possible to perform transcriptome-wide association studies for even the rarest cell-types without an increased risk of false positives.


Method
This section summarizes the methods.Details are given in the supplemental material (e.g., S1.1 refers to Sect.1.1 in the supplemental material).
s n RNA-seq data sets, quality control and data processing.We downloaded FASTQ files from seven published s n RNA-seq in post-mortem brain samples [17][18][19][20][21][22][23] .All brain regions involved the prefrontal cortex, predominantly from Brodmann areas BA6, BA8, BA9, BA10, and BA24.To avoid confounding the expression values in the panel by disease processes or disease specific cell states, only the unaffected "controls" from these studies were used.
All seven studies partitioned nuclei using the Chromium Controller (10X Genomics) and sequenced the libraries on Illumina platforms.We used the cellranger 24 software for aligning the reads to GRCh38 and creating a matrix of unique molecular identified (UMI) counts (i.e., the number of unique molecules for each gene detected in each nucleus).s n RNA-seq data primarily yields reads derived from mature spliced RNA (mRNA), which maps to exonic regions but may also capture unspliced pre-mRNA transcripts that can generate intronic reads [25][26][27] .As nuclei contain a relatively large fraction of pre-mRNA molecules and such molecules are particularly abundant in brain tissue 28 , to obtain a comprehensive picture of gene expression we counted intronic reads as well 29 .
We performed quality control (QC) on samples and nuclei using exactly the same criteria across all studies.Specifically, we eliminated samples with very high levels of debris (Figure S1).In addition, we removed nuclei with very low (indicating low-quality nuclei or empty droplets) or high (indicating "multiplets" that capture expression levels of multiple nuclei) gene and UMI counts (Figures S2 and S4).Finally, nuclei with a high percentage of reads mapping to mitochondrial genes (possible indicating artifacts stemming from sample preparation) were eliminated.
For each study separately, the QC' ed count data was log-normalized to obtain more normal distributions and reduce effects of possible outliers.Furthermore, to avoid that highly expressed genes dominate the cluster analyses, genes were given equal weight by scaling the log-normalized count data to have a mean of zero and a standard deviation of one.

Clustering.
To identify cell-types, we performed a cluster analysis in Seurat 30 (S.1.1).We "anchored" the different datasets in a shared cluster space to facilitate their integration 31 .The cluster analysis was limited to the 2,000 genes that exhibited the highest nucleus-to-nucleus variation (i.e., highly expressed in some nuclei and lowly expressed in others) 32 .There are potentially a large number of donor-level covariates (e.g., medication use, cause of death, cDNA yield, post-mortem interval) that may obscure the separation of clusters.However, as many nuclei are assayed from the same donor, we can remove the effects of donor-level confounders by controlling for the factor "donor".Technically this was achieved by regressing out "indicator" variables for the donors (i.e., for each donor these is one variables that has a value of 1 for that donor and is zero for all other donors).Thus, the data are analyzed as deviations from the donor specific means so that any variable that contributes to differences between donors will no longer affect the measurements.To control for nuclei-level confounding, we also regressed out the QC indices listed above.
Deconvolution.Deconvolution involves three steps.First, a reference panel 33,34 is created (S1.2).To select genes for the panel, we used MAST 35 that performs significance tests to identify the genes that best discriminate between the cell-types.The expression values from the s n RNA-seq data were scaled to have a mean of zero and variance of one for each study, and then an average expression value was computed across all studies.This was done to avoid that the panel was dominated by specific studies (e.g., large studies) and ensure that the derived cell-types were robustly identified across all studies.
Second, the reference panel in combination with the bulk RNA-seq data is used to estimate cell-type proportions in each bulk sample.To estimate the cell-type proportions, we use the standard linear model 36 but estimated by empirical Bayes 37 (EB) to enable precise estimation of potentially multiple low abundant cell-types (for estimation details see S1.3.1 and R code is provided at https:// github.com/ ejvan denoo rd/ Empir ical-Bayes-estim ation-of-cell-type-propo rtions).The mean and the standard deviation of estimates produced by fitting the same model subject to a non-negativity constraint for the regression coefficients (i.e., the cell-type proportions) was used as the prior distribution.A recent paper 16 performed a comprehensive evaluation of methods to estimate cell-type proportions from brain transcriptome data.Based on their performance, the authors recommended CIBERSORT 38 , dtangle 39 or MuSiC 40 .To evaluate our estimator, we compared it with these three methods.For CIBERSORT we the latest version, called CIBERSORTx 41 , as used for the web version of the software.
Third, the estimated cell-type proportions are used to perform cell-type specific association studies with bulk data, This is done by fitting, for each transcript, the model as described elsewhere 12 (see also S1.3.2).These association analyses were performed using the Bioconductor package RaMWAS 42 .
Demonstration bulk RNA-seq dataset.Bulk RNA-seq data was generated using tissue from BA10 of from 291 control individuals and 304 individuals that were diagnosed with a psychiatric disorder (S1.4).All experimental protocols were approved by the local IRB at Virginia Commonwealth University and informed consent was obtained from all subjects and/or their legal guardian(s).All methods were performed in accordance with the relevant guidelines and regulations by including a statement in the methods section.The RNA-seq data was generated using the TruSeq Stranded Total RNA library kit.The sequenced reads were aligned with HISAT2 (v.2.1.0)and transcriptome assembly was performed with StringTie 43 .All analyses (i.e., cell-type proportion estimation and deconvolution analyses) regressed out the covariates: sex and age, indicator variables to account for possible brain banks effects, and assay-related covariates such as total number of reads and the percentage of reads aligned.Furthermore, to account for remaining unmeasured sources of variation, six principal components (as suggested by the scree plot) that were used as covariates after regressing out the measured covariates from the bulk RNA-seq data.

Results
Sample description and QC.The seven studies included s n RNA-seq data from 94 unaffected "control" subjects.The mean age was 61.6 years (SD = 28.6 years) with the 5th/95th percentiles of 12.7/90.0years indicating a very broad range.The subjects comprised 37% females.The post-mortem interval was 19.6 h (SD = 15; 5th/95th percentile of 2.5/49.4h).
Table S1 lists assay related statistics.In summary, we observed an average of 65,118 reads per nucleus.Of these reads, 93.6% mapped to the genome with 78.8% of reads having nucleus-associated barcodes.Using the same criteria for all seven studies, we quality controlled samples and nuclei (S2.1, Figures S1-S5).Two studies had many more nuclei per donor (34,342 and 22,831 nuclei) than the other five studies (mean 5154 nuclei).To avoid that the clustering was mainly driven by these two studies, we down-sampled their nuclei to 8,562 and 8,567 to obtain an average of 5,547 nuclei (range 1426-10,039 nuclei) across all seven studies.After QC and down-sampling, 353,146 nuclei from 92 donors remained.

Clustering and cell-type labeling.
Clustering identified 20 groups of nuclei, 17 of which were observed in all seven studies.The three clusters that were not consistently observed were removed from further analyses.Figure 1 visualizes the cell-type clusters.To plot the clusters, which differ on many dimensions, in a two-dimensional space we used Uniform Manifold Approximation and Projection (UMAP).
Figure S6 provides a dotplot for the markers used to label the cell-types and Figure S7a-e shows heatmaps of the overlap between the cell-type labels assigned in this study and the label assigned in the five of the seven original studies that provided nuclei labels.These results are further summarized in Table S2 provides for each of our cell-type clusters a list of the most highly expressed markers as well as the most frequently assigned original cell-type label in the five studies that provided labeled nuclei.
Of the 17 clusters, 14 could readily be labeled using standard markers.Although it should be noted that only two studies attempted labeling subtypes of broad groups of nuclei (e.g., excitatory neurons), the nuclei of the 14 clusters were consistently labeled by the five studies that provided the original cell-type labels.These 14 clusters included one of the two clusters of oligodendrocytes (OLI.1) 44, oligodendrocyte precursor cells (OPC) 44 , astrocytes(AST) 45 and microglia (MGL) 46 .Four clusters of interneurons (IN) were identified that could further be labeled based on the expression of somatostatin (IN.SST), parvalbumin (IN.PV), vasoactive intestinal peptide (IN.VIP), and synaptic vesicle glycoprotein 2C (IN.SV2C) 47 .Finally, seven groups of excitatory neurons were identified.These neurons were further subdivided into one cluster of upper-layer (EX.UL) neurons and four clusters of deep-layer (EX.UL1-EX.UL4) neurons all expressing FOXP2 and subsets of other standard layerspecific markers.Furthermore, we observed neurons expressing neurogranin (EX.NRGN).
Three clusters could not unequivocally be labeled with standard markers and were also inconsistently labeled across the five studies that provided labels for individual nuclei.First, we observed a cluster expressing standard markers for both endothelial cells 48 and pericytes 37 .In the original studies these nuclei were labeled as endothelial cells 48 , pericytes 37 , or as a combined cluster of endothelial cells and pericytes.As these nuclei most likely included both endothelial cells and pericytes that have very similar expression profiles relative to the other clusters in Fig. 1, were labeled this cluster END/PER.
Second, albeit at relative modest levels compared to OLI.1, the second cluster of oligodendrocytes (OLI.2) expressed standard oligodendrocytes markers MBP, PLP1, and MOBP.In addition, we observed the expression of NRGN, CAMK2A, and CAMK2B that share a motif with MBP potentially allowing it to be packaged together for cytoplasmic transport to dendrites 49 .Three studies labeled these nuclei as oligodendrocytes and the other two studies as neurons.Neurons can use the same packaging mechanism for cytoplasmic transport of the RNAs to dendrites and this potentially explains the confusion about the identity of this second group of oligodendrocytes.
Third, a cluster of EX neurons expressed only few of the markers expressed by the other EX clusters and was inconsistently labeled with respect to cortical layer in the two studies that labeled EX subtypes.This EX cluster expressed NRG1 at very high levels (EX.NRG1).NRG1 is expressed in multiple cell-types and best known as a gene affecting a range of psychiatric and neurological disorders such as Alzheimer, autism and schizophrenia 50,51 .
To learn more about the identity of this cluster, we selected the ten most highly expressed genes from the reference panel.Six of the ten genes were previously reported to be associated with a range of psychiatric and neurological disorders.In addition to NRG1 50,51 , this included ZNF804B 52 , CDH12 53 , CLSTN2 54,55 , RIT2 56 , and MCTP1 57 .This pattern is somewhat reminiscent of so-called Von Economo neurons (VENs) that are known to be altered in diseases such as Alzheimer, autism, and schizophrenia [58][59][60] .VENs are found in humans and great apes (but not other primates), cetaceans, and elephants, and may have evolved for the rapid transmission of crucial social information in very large brains 61 .In humans, VENs are abundant in the anterior cingulate and frontoinsular cortices but are also present in the prefrontal cortex 62 .A recent study involving 879 nuclei from frontoinsula layer 5 identified several VEN markers, but these markers were not highly expressed in our cluster.S3 gives the MAST 35 test results identifying 1,652 genes for the reference panel (Table S4).The EB estimation procedure was first evaluated using artificial bulk data.We generated artificial bulk data using the cell-type specific expression values from the panel in combination with celltype proportions that were randomly drawn from a generalized beta distribution assuming the mean, standard deviation minimum, and maximum of the cell-type proportions observed in our demonstration bulk RNA-seq dataset.www.nature.com/scientificreports/CIBERSORTx Table 1 shows a comparison of the EB methods with methods previously recommended after a comprehensive evaluation of methods to estimate cell-type proportions from brain transcriptome data 16 .The MuSiC package is specifically designed for s n RNA-seq data.We had to down-sample the number of nuclei to avoid excessive run times.We could also not replicate our previous observation that MuSiC produces superior estimates 63 .Instead, results were so poor that they likely indicated convergence problems so that this estimator was not further considered.Table 1 shows that all estimators were unbiased.However, the EB estimator was substantially more precise.Thus, compared to EB the RMSEs for CIBERSORTx and dtangle were five (0.025 vs. 0.005) and 6.8 (0.034 vs. 0.005) times larger.This increased precision translated to systematically higher correlations (0.936 vs. 0.846 and 0.784 for CIBERSORTx and dtangle) between the true and estimated cell-type proportions.These correlations remained satisfactory even for rare cell-types.Although cell-type proportions close to zero can be estimated at zero by chance, a large number of zeroes may indicate problems with estimating low abundances precisely.CIBERSORTx produced a relatively large number of zeroes particularly for low abundant cell-types.In contrast, dtangle did not produce any zeroes but as the other indices indicated that this estimator had the lowest precision this may be interpreted as meaning that it estimates low abundances precisely.

Cell-type proportion estimation. Table
We also studied the performance under less ideal circumstances.That is, to simulate a mismatch between panel and bulk data, we replaced 50% of all bulk gene expression data with a random value and also increased the error in the bulk data by a factor 10. Table 2 shows that the pattern of results mimicked those from Table 1.Although the RMSE and correlation decreased the EB seemed robust producing estimates that could still be used in research.
Cell-type specific association studies.Figure 2 shows that the mean of the cell-type proportion estimates in our demonstration bulk RNA-seq dataset was highly correlated with the mean s n RNA-seq counts (r = 0.951).Only EX.NRGN showed a notable difference.Given that our simulation study yielded unbiased estimates, this most likely reflects true biological variation between the two sample sets.Similar to what we observed in the simulation study, even for rare cell-types few estimates were estimated to be zero (average 3.2%).
To study whether the distribution of the tests statistics under the null hypothesis followed the assumed theoretical distribution, 1,000 transcriptome-wide association studies (TWASs) where performed after randomly permuting case-control labels.Results showed that for each cell-type lambda (ratio of the median of the observed distribution of the test statistic to the expected median) was close to one (Fig. 3, overall median/mean 1.01/1.03with range 0.90-1.28).This implied the absence of test statistic inflation and that under the null distribution accurate P values are obtained.This was true for even the rarest cell-types suggesting that it is possible to perform TWASs on rare cell-types without an increased risk of false positives.
Overall 11 genes were transcriptome-wide significant when controlling the FDR controlled at the 0.1 level and 105 findings reached "suggestive" significance when controlling the FDR controlled at the 0.1 level FDR controlled at the 0.25 level (i.e., meaning that 10 and 25%, respectively, of the findings are expected to be false).
We studied whether power could be improved by grouping similar cell-types.For this purpose, we performed a principal components analysis (PCA) followed by a varimax rotation on the gene expression values of the Table 1.Evaluate the empirical Bayes estimation procedure.RMSE is the root of the means squared error, Zero is the number of cell-type proportions estimated at zero, and Cor. is the correlation between true and estimated cell-type proportions.www.nature.com/scientificreports/panel.Cell-types with loadings > 0.5 on the same principal component were combined.The PCA suggested that six groups of two cell-types (Table S5) could potentially be combined leaving 11 cell-types.These PCA results corresponded very well with the UMAP plot (Fig. 1) showing proximity of these same cell-types.After grouping 68 genes reached transcriptome-wide significance when controlling the FDR controlled at the 0.1 level and 106 findings reached "suggestive" significance when controlling the FDR controlled at the 0.1 level FDR controlled at the 0.25 level.This the signal improved most likely because in this demonstration dataset the combined celltypes showed similar association signal.

Discussion
We propose a new brain reference panel that allows the detection of differentially expressed genes in human bulk brain data on a fine-grained cell-type specific level.We created the panel through an integrated (mega) analysis of the data from the seven largest s n RNA-seq studies in brain after processing all data in exactly the same way.Our panel included 17 cell-types that were robustly detected across all studies, subregions of the prefrontal cortex, and sex and age groups.Our goal was not to make a complete inventory of all cell-types in the PFC but to create a reference panel that can subsequently be used for cell-type specific association studies.Thus, cell-types that were not observed in all studies were omitted and we used settings in the cluster analyses that would avoid a very large number of cell-type clusters.We believe that the proposed panel strikes a good balance between being fine-grained but not to the extent that all the cell-type proportions can no longer be estimated precisely.
To estimate the cell-type proportions, we proposed an empirical Bayes estimator that yielded highly accurate and unbiased estimates even for the low abundant cell-types.Our estimator substantially outperformed the three estimators recommended previously after a comprehensive evaluation of methods to estimate cell-type proportions from brain transcriptome data 16 .Our panel contains a substantial number of cell-types, several of which had low abundances.A precise estimator is critical as to avoid that downstream analyses with low abundant cell-types produce unreliable results.Furthermore, our estimator has the desirable property that it uses a panel Table 2. Relative robustness of the empirical Bayes estimation procedure.RMSE is the root of the means squared error, Zero is the number of cell-type proportions estimated at zero, and Cor.Is the correlation between true and estimated cell-type proportions.comprising mean expression levels rather than the nuclei level s n RNA-seq data.This prevents working with very large data files and the need to apply for access to obtain all the nuclei level data from repositories.Transcriptome-wide association studies performed with permuted bulk RNA-seq data showed that it is possible to perform TWASs for even the rarest cell-types without an increased risk of false positives.For example, even cell-types with frequencies as low as 1% yielded transcriptome-wide significant results in the absence of test statistic inflation.Furthermore, analyses showed that more significant findings were obtained by grouping similar cell-types.How, this finding may be specific for our demonstration dataset and it is very well possible that in other data sets fewer significant findings are obtained when grouped cell-type so not show associations with the same genes.
The proposed approach requires bulk gene expression data to estimate the cell-type proportions.However, once the cell-type proportions are estimated, we can study cell-type specific associations for any other bulk data generated for the same brain samples (e.g., microRNAs, methylation data, open chromatin data).Although s n RNA-seq studies, and consequently our panel, involve gene expression, we can therefore also study individual transcripts.This is important as only specific transcripts of the gene may be differentially expressed.In these scenarios the study of expression at the gene level will dilute association signals and result in a loss of potentially critical biological information.
Whereas s n RNA-seq assays nuclear RNAs with a poly A-tail (mainly mRNA), our bulk RNA-seq data assayed total RNA from the entire cells which may contain transcripts not present in the nucleus 64,65 .However, the means of the cell-type proportion estimates in our bulk RNA-seq dataset were highly correlated with the cell-type proportion counts observed in the s n RNA-seq data.This suggested that possible differences in expression levels between the panel genes in the nucleus versus the entire cell did not distort the cell-type proportion estimates.
Even with the advent of s n RNA-seq, deconvolution methods are likely to remain pertinent for cell-type specific association studies with brain tissue.This is because the vast majority of existing gene expression data sets involves bulk samples.Deconvolution allows the (re-)use of this "legacy" data to study cell-type specific effects.Deconvolution methods can also potentially be useful to validate findings from s n RNA-seq studies.Such validation with a different technology can eliminate possible false discoveries due to technology specific artefacts and therefore allows for more rigorous conclusions.
In summary, brain disorders are leading causes of disability world-wide.We proposed a new reference panel and precise estimator for the cell-type proportions that allows the use of bulk brain data to study brain disorders www.nature.com/scientificreports/ on a fine-grained cell-type specific level.The use of this approach may prevent that many cell-type specific disease associations remain undetected in studies with bulk data.Furthermore, identifying the specific cell-types from which association signals originate is key to formulating refined hypotheses about the etiology of brain disorders, designing proper follow-up experiments and, eventually, developing novel clinical interventions.The reference panel and easy-to-use accompanying analysis tools are publicly available.

Figure 1 .
Figure 1.A Uniform Manifold Approximation and Projection (UMAP) plot depicting the nuclei of the 17 clusters.The cluster labels are described in the main text.

Figure 2 .
Figure 2. Mean cell-type frequencies observed in the s n RNA-seq data and estimated in bulk samples.