Fine-grained cell-type specific association studies with human bulk brain data using a large single-nucleus RNA sequencing based reference panel

van den Oord, Edwin J. C. G.; Aberg, Karolina A.

doi:10.1038/s41598-023-39864-2

Download PDF

Article
Open access
Published: 10 August 2023

Fine-grained cell-type specific association studies with human bulk brain data using a large single-nucleus RNA sequencing based reference panel

Edwin J. C. G. van den Oord¹ &
Karolina A. Aberg¹

Scientific Reports volume 13, Article number: 13004 (2023) Cite this article

785 Accesses
Metrics details

Subjects

Abstract

Brain disorders are leading causes of disability worldwide. Gene expression studies provide promising opportunities to better understand their etiology but it is critical that expression is studied on a cell-type level. Cell-type specific association studies can be performed with bulk expression data using statistical methods that capitalize on cell-type proportions estimated with the help of a reference panel. To create a fine-grained reference panel for the human prefrontal cortex, we performed an integrated analysis of the seven largest single nucleus RNA-seq studies. Our panel included 17 cell-types that were robustly detected across all studies, subregions of the prefrontal cortex, and sex and age groups. To estimate the cell-type proportions, we used an empirical Bayes estimator that substantially outperformed three estimators recommended previously after a comprehensive evaluation of methods to estimate cell-type proportions from brain transcriptome data. This is important as being able to precisely estimate the cell-type proportions may avoid unreliable results in downstream analyses particularly for the multiple cell-types that had low abundances. Transcriptome-wide association studies performed with permuted bulk expression data showed that it is possible to perform transcriptome-wide association studies for even the rarest cell-types without an increased risk of false positives.

Cell-type-specific cis-eQTLs in eight human brain cell types identify novel risk genes for psychiatric and neurological disorders

Article 01 August 2022

CommonMind Consortium provides transcriptomic and epigenomic data for Schizophrenia and Bipolar Disorder

Article Open access 24 September 2019

Comprehensive analyses of RNA-seq and genome-wide data point to enrichment of neuronal cell type subsets in neuropsychiatric disorders

Article Open access 01 November 2021

Introduction

Brain disorders such as mood disorders, dementias, stress related disorders, neurodevelopmental disorders, seizure disorders, and addictions are leading causes of disability worldwide¹. Gene expression studies provide promising opportunities to better understand their etiology. The human brain comprises a diverse set of cell-types^2,3,4. As these cells differ in their functions, gene expression will typically also vary across these cell-types. When studying bulk tissue, this cellular diversity may cause many genes that are differentially expressed in cases and controls to remain undetected⁵. That is, association signals will be “diluted” if they affect only one cell-type, may cancel out if they are of opposite signs across cell-types, and may be undetectable if they involve low-abundant cells.

Identifying the specific cell-types from which association signals originate is also critical for scientific progress and important from a translational perspective. First, it allows formulating refined hypotheses about disease etiology. For example, the involvement of microglia may point to disrupted immune response and neuroinflammation of the brain⁶, a loss of neuronal function may point to neurodegeneration⁷, and the involvement of the myelin-producing oligodendrocytes may suggest disrupted neuronal communication⁸. Second, knowledge about the cell-type is important to design proper in vitro or in vivo functional follow-up studies. Thus, as gene expression may only be altered in specific cells, such studies require the right choice of cultured cells or experimental tools (e.g., the use herpes simplex virus type 1 as a vector for locus-specific editing is of primary relevance for association findings in neurons^9,10). Third, cell-type knowledge is key for developing novel and effective treatments. For example, drugs often work by interacting with receptors on the surface of cells. Receptor molecules have a specific three-dimensional structure, which allows only substances that fit precisely to attach to it. From a drug development perspective, designing drugs that interact specifically with receptors from particular cell-types is also highly desired since non-specific drugs can cause more side effects.

Capitalizing on deconvolution methods, cell-type specific associations can also be studied statistically using bulk RNA-seq data^5,11. Deconvolution was introduced 20 years ago¹¹ and has been experimentally validated using, for instance, predesigned mixtures¹². Deconvolution is most effective when performed with a reference panel¹³. Reference panels comprise the expression profiles of the cell-types present in the target tissue that are typically generated from a small number of samples. The reference panel is used to estimate cell-type proportions in the bulk samples, which is in turn are used to perform cell-type specific association studies with bulk data. Reference panels can be created through expression profiling of sorted cells. However, while good nuclear protein markers exist for sorting nuclei into broad groups of neurons and glia, there is a lack of known, high fidelity, antigens and antibodies for further sorting subclasses of these brain cells. A better alternative is therefore to create the reference panel from single cell/nucleus RNA sequencing data¹⁴ that allows a fine grained analysis of brain cell-types. In comparison to whole cells, nuclei are more resistant to mechanical assaults and are less vulnerable to the tissue dissociation process. This makes single nucleus RNA sequencing (s_nRNA-seq) the more suitable option for human post-mortem brain tissue¹⁵. With this approach intact nuclei are first isolated and partitioned so that the content of each nucleus can be labeled with a unique identifier. A labeled sequencing library is subsequently generated and sequenced for each individual nucleus.

A recent paper evaluated multiple reference panels and methods for estimating cell-type proportions for studies in brain¹⁶. However, several of the studied reference panels were small as they involved the first generation of s_nRNA-seq studies, used specific donor (e.g., only male subjects) and age groups, focused on subregions of the prefrontal cortex, were not fine-grained or derived though a rigorous analysis of large s_nRNA-seq studies. In this article we combined data from the seven largest published s_nRNA-seq studies in the prefrontal cortex^{17,18,19,20,21,22,23}, a brain region of key importance for higher level brain processes such as cognition, emotion, and memory. We derived the panel through an integrated analysis after processing all data in exactly the same way. A main advantage of this “mega-analysis” approach is that is reduces study specific technical artefacts. By focusing on cell-types robustly identified across the different studies, the panel will also have has more general applicability as it can be used across donor groups and brain regions. To estimate the cell-type proportions, needed to perform cell-type specific association studies with bulk data, we propose an estimator that can be used for a fine grained analysis of brain cell-types including multiple cell-types that are relatively rare. Finally, we study how to best use the proposed approach to optimize power and avoid false discoveries in empirical transcriptome-wide association studies with bulk data.

Method

This section summarizes the methods. Details are given in the supplemental material (e.g., S1.1 refers to Sect. 1.1 in the supplemental material).

s_nRNA-seq data sets, quality control and data processing

We downloaded FASTQ files from seven published s_nRNA-seq in post-mortem brain samples^{17,18,19,20,21,22,23}. All brain regions involved the prefrontal cortex, predominantly from Brodmann areas BA6, BA8, BA9, BA10, and BA24. To avoid confounding the expression values in the panel by disease processes or disease specific cell states, only the unaffected “controls” from these studies were used.

All seven studies partitioned nuclei using the Chromium Controller (10X Genomics) and sequenced the libraries on Illumina platforms. We used the cellranger²⁴ software for aligning the reads to GRCh38 and creating a matrix of unique molecular identified (UMI) counts (i.e., the number of unique molecules for each gene detected in each nucleus). s_nRNA-seq data primarily yields reads derived from mature spliced RNA (mRNA), which maps to exonic regions but may also capture unspliced pre-mRNA transcripts that can generate intronic reads^25,26,27. As nuclei contain a relatively large fraction of pre-mRNA molecules and such molecules are particularly abundant in brain tissue²⁸, to obtain a comprehensive picture of gene expression we counted intronic reads as well²⁹.

We performed quality control (QC) on samples and nuclei using exactly the same criteria across all studies. Specifically, we eliminated samples with very high levels of debris (Figure S1). In addition, we removed nuclei with very low (indicating low-quality nuclei or empty droplets) or high (indicating “multiplets” that capture expression levels of multiple nuclei) gene and UMI counts (Figures S2 and S4). Finally, nuclei with a high percentage of reads mapping to mitochondrial genes (possible indicating artifacts stemming from sample preparation) were eliminated.

For each study separately, the QC’ed count data was log-normalized to obtain more normal distributions and reduce effects of possible outliers. Furthermore, to avoid that highly expressed genes dominate the cluster analyses, genes were given equal weight by scaling the log-normalized count data to have a mean of zero and a standard deviation of one.

Clustering

To identify cell-types, we performed a cluster analysis in Seurat³⁰ (S.1.1). We "anchored" the different datasets in a shared cluster space to facilitate their integration³¹. The cluster analysis was limited to the 2,000 genes that exhibited the highest nucleus-to-nucleus variation (i.e., highly expressed in some nuclei and lowly expressed in others)³². There are potentially a large number of donor-level covariates (e.g., medication use, cause of death, cDNA yield, post-mortem interval) that may obscure the separation of clusters. However, as many nuclei are assayed from the same donor, we can remove the effects of donor-level confounders by controlling for the factor “donor”. Technically this was achieved by regressing out “indicator” variables for the donors (i.e., for each donor these is one variables that has a value of 1 for that donor and is zero for all other donors). Thus, the data are analyzed as deviations from the donor specific means so that any variable that contributes to differences between donors will no longer affect the measurements. To control for nuclei-level confounding, we also regressed out the QC indices listed above.

Deconvolution

Deconvolution involves three steps. First, a reference panel^33,34 is created (S1.2). To select genes for the panel, we used MAST³⁵ that performs significance tests to identify the genes that best discriminate between the cell-types. The expression values from the s_nRNA-seq data were scaled to have a mean of zero and variance of one for each study, and then an average expression value was computed across all studies. This was done to avoid that the panel was dominated by specific studies (e.g., large studies) and ensure that the derived cell-types were robustly identified across all studies.

Second, the reference panel in combination with the bulk RNA-seq data is used to estimate cell-type proportions in each bulk sample. To estimate the cell-type proportions, we use the standard linear model³⁶ but estimated by empirical Bayes³⁷ (EB) to enable precise estimation of potentially multiple low abundant cell-types (for estimation details see S1.3.1 and R code is provided at https://github.com/ejvandenoord/Empirical-Bayes-estimation-of-cell-type-proportions). The mean and the standard deviation of estimates produced by fitting the same model subject to a non-negativity constraint for the regression coefficients (i.e., the cell-type proportions) was used as the prior distribution. A recent paper¹⁶ performed a comprehensive evaluation of methods to estimate cell-type proportions from brain transcriptome data. Based on their performance, the authors recommended CIBERSORT³⁸, dtangle³⁹ or MuSiC⁴⁰. To evaluate our estimator, we compared it with these three methods. For CIBERSORT we the latest version, called CIBERSORTx⁴¹, as used for the web version of the software.

Third, the estimated cell-type proportions are used to perform cell-type specific association studies with bulk data, This is done by fitting, for each transcript, the model as described elsewhere¹² (see also S1.3.2). These association analyses were performed using the Bioconductor package RaMWAS⁴².

Demonstration bulk RNA-seq dataset

Bulk RNA-seq data was generated using tissue from BA10 of from 291 control individuals and 304 individuals that were diagnosed with a psychiatric disorder (S1.4). All experimental protocols were approved by the local IRB at Virginia Commonwealth University and informed consent was obtained from all subjects and/or their legal guardian(s). All methods were performed in accordance with the relevant guidelines and regulations by including a statement in the methods section. The RNA-seq data was generated using the TruSeq Stranded Total RNA library kit. The sequenced reads were aligned with HISAT2 (v.2.1.0) and transcriptome assembly was performed with StringTie⁴³. All analyses (i.e., cell-type proportion estimation and deconvolution analyses) regressed out the covariates: sex and age, indicator variables to account for possible brain banks effects, and assay-related covariates such as total number of reads and the percentage of reads aligned. Furthermore, to account for remaining unmeasured sources of variation, six principal components (as suggested by the scree plot) that were used as covariates after regressing out the measured covariates from the bulk RNA-seq data.

Results

Sample description and QC

The seven studies included s_nRNA-seq data from 94 unaffected “control” subjects. The mean age was 61.6 years (SD = 28.6 years) with the 5th/95th percentiles of 12.7/90.0 years indicating a very broad range. The subjects comprised 37% females. The post-mortem interval was 19.6 h (SD = 15; 5th/95th percentile of 2.5/49.4 h).

Table S1 lists assay related statistics. In summary, we observed an average of 65,118 reads per nucleus. Of these reads, 93.6% mapped to the genome with 78.8% of reads having nucleus-associated barcodes. Using the same criteria for all seven studies, we quality controlled samples and nuclei (S2.1, Figures S1–S5). Two studies had many more nuclei per donor (34,342 and 22,831 nuclei) than the other five studies (mean 5154 nuclei). To avoid that the clustering was mainly driven by these two studies, we down-sampled their nuclei to 8,562 and 8,567 to obtain an average of 5,547 nuclei (range 1426–10,039 nuclei) across all seven studies. After QC and down-sampling, 353,146 nuclei from 92 donors remained.

Clustering and cell-type labeling

Clustering identified 20 groups of nuclei, 17 of which were observed in all seven studies. The three clusters that were not consistently observed were removed from further analyses. Figure 1 visualizes the cell-type clusters. To plot the clusters, which differ on many dimensions, in a two-dimensional space we used Uniform Manifold Approximation and Projection (UMAP).

Figure S6 provides a dotplot for the markers used to label the cell-types and Figure S7a–e shows heatmaps of the overlap between the cell-type labels assigned in this study and the label assigned in the five of the seven original studies that provided nuclei labels. These results are further summarized in Table S2 provides for each of our cell-type clusters a list of the most highly expressed markers as well as the most frequently assigned original cell-type label in the five studies that provided labeled nuclei.

Of the 17 clusters, 14 could readily be labeled using standard markers. Although it should be noted that only two studies attempted labeling subtypes of broad groups of nuclei (e.g., excitatory neurons), the nuclei of the 14 clusters were consistently labeled by the five studies that provided the original cell-type labels. These 14 clusters included one of the two clusters of oligodendrocytes (OLI.1)⁴⁴, oligodendrocyte precursor cells (OPC)⁴⁴, astrocytes(AST)⁴⁵ and microglia (MGL)⁴⁶. Four clusters of interneurons (IN) were identified that could further be labeled based on the expression of somatostatin (IN.SST), parvalbumin (IN.PV), vasoactive intestinal peptide (IN.VIP), and synaptic vesicle glycoprotein 2C (IN.SV2C)⁴⁷. Finally, seven groups of excitatory neurons were identified. These neurons were further subdivided into one cluster of upper-layer (EX.UL) neurons and four clusters of deep-layer (EX.UL1-EX.UL4) neurons all expressing FOXP2 and subsets of other standard layer-specific markers. Furthermore, we observed neurons expressing neurogranin (EX.NRGN).

Three clusters could not unequivocally be labeled with standard markers and were also inconsistently labeled across the five studies that provided labels for individual nuclei. First, we observed a cluster expressing standard markers for both endothelial cells⁴⁸ and pericytes³⁷. In the original studies these nuclei were labeled as endothelial cells⁴⁸, pericytes ³⁷, or as a combined cluster of endothelial cells and pericytes. As these nuclei most likely included both endothelial cells and pericytes that have very similar expression profiles relative to the other clusters in Fig. 1, were labeled this cluster END/PER.

Second, albeit at relative modest levels compared to OLI.1, the second cluster of oligodendrocytes (OLI.2) expressed standard oligodendrocytes markers MBP, PLP1, and MOBP. In addition, we observed the expression of NRGN, CAMK2A, and CAMK2B that share a motif with MBP potentially allowing it to be packaged together for cytoplasmic transport to dendrites⁴⁹. Three studies labeled these nuclei as oligodendrocytes and the other two studies as neurons. Neurons can use the same packaging mechanism for cytoplasmic transport of the RNAs to dendrites and this potentially explains the confusion about the identity of this second group of oligodendrocytes.

Third, a cluster of EX neurons expressed only few of the markers expressed by the other EX clusters and was inconsistently labeled with respect to cortical layer in the two studies that labeled EX subtypes. This EX cluster expressed NRG1 at very high levels (EX. NRG1). NRG1 is expressed in multiple cell-types and best known as a gene affecting a range of psychiatric and neurological disorders such as Alzheimer, autism and schizophrenia^50,51. To learn more about the identity of this cluster, we selected the ten most highly expressed genes from the reference panel. Six of the ten genes were previously reported to be associated with a range of psychiatric and neurological disorders. In addition to NRG1^50,51, this included ZNF804B⁵², CDH12⁵³, CLSTN2^54,55, RIT2⁵⁶, and MCTP1⁵⁷. This pattern is somewhat reminiscent of so-called Von Economo neurons (VENs) that are known to be altered in diseases such as Alzheimer, autism, and schizophrenia^58,59,60. VENs are found in humans and great apes (but not other primates), cetaceans, and elephants, and may have evolved for the rapid transmission of crucial social information in very large brains⁶¹. In humans, VENs are abundant in the anterior cingulate and frontoinsular cortices but are also present in the prefrontal cortex⁶². A recent study involving 879 nuclei from frontoinsula layer 5 identified several VEN markers, but these markers were not highly expressed in our cluster.

Cell-type proportion estimation

Table S3 gives the MAST³⁵ test results identifying 1,652 genes for the reference panel (Table S4). The EB estimation procedure was first evaluated using artificial bulk data. We generated artificial bulk data using the cell-type specific expression values from the panel in combination with cell-type proportions that were randomly drawn from a generalized beta distribution assuming the mean, standard deviation minimum, and maximum of the cell-type proportions observed in our demonstration bulk RNA-seq dataset.

CIBERSORTx Table 1 shows a comparison of the EB methods with methods previously recommended after a comprehensive evaluation of methods to estimate cell-type proportions from brain transcriptome data¹⁶. The MuSiC package is specifically designed for s_nRNA-seq data. We had to down-sample the number of nuclei to avoid excessive run times. We could also not replicate our previous observation that MuSiC produces superior estimates⁶³. Instead, results were so poor that they likely indicated convergence problems so that this estimator was not further considered. Table 1 shows that all estimators were unbiased. However, the EB estimator was substantially more precise. Thus, compared to EB the RMSEs for CIBERSORTx and dtangle were five (0.025 vs. 0.005) and 6.8 (0.034 vs. 0.005) times larger. This increased precision translated to systematically higher correlations (0.936 vs. 0.846 and 0.784 for CIBERSORTx and dtangle) between the true and estimated cell-type proportions. These correlations remained satisfactory even for rare cell-types. Although cell-type proportions close to zero can be estimated at zero by chance, a large number of zeroes may indicate problems with estimating low abundances precisely. CIBERSORTx produced a relatively large number of zeroes particularly for low abundant cell-types. In contrast, dtangle did not produce any zeroes but as the other indices indicated that this estimator had the lowest precision this may be interpreted as meaning that it estimates low abundances precisely.

Table 1 Evaluate the empirical Bayes estimation procedure.

Full size table

We also studied the performance under less ideal circumstances. That is, to simulate a mismatch between panel and bulk data, we replaced 50% of all bulk gene expression data with a random value and also increased the error in the bulk data by a factor 10. Table 2 shows that the pattern of results mimicked those from Table 1. Although the RMSE and correlation decreased the EB seemed robust producing estimates that could still be used in research.

Table 2 Relative robustness of the empirical Bayes estimation procedure.

Full size table

Cell-type specific association studies

Figure 2 shows that the mean of the cell-type proportion estimates in our demonstration bulk RNA-seq dataset was highly correlated with the mean s_nRNA-seq counts (r = 0.951). Only EX.NRGN showed a notable difference. Given that our simulation study yielded unbiased estimates, this most likely reflects true biological variation between the two sample sets. Similar to what we observed in the simulation study, even for rare cell-types few estimates were estimated to be zero (average 3.2%).

To study whether the distribution of the tests statistics under the null hypothesis followed the assumed theoretical distribution, 1,000 transcriptome-wide association studies (TWASs) where performed after randomly permuting case–control labels. Results showed that for each cell-type lambda (ratio of the median of the observed distribution of the test statistic to the expected median) was close to one (Fig. 3, overall median/mean 1.01/1.03 with range 0.90–1.28). This implied the absence of test statistic inflation and that under the null distribution accurate P values are obtained. This was true for even the rarest cell-types suggesting that it is possible to perform TWASs on rare cell-types without an increased risk of false positives.

Overall 11 genes were transcriptome-wide significant when controlling the FDR controlled at the 0.1 level and 105 findings reached “suggestive” significance when controlling the FDR controlled at the 0.1 level FDR controlled at the 0.25 level (i.e., meaning that 10 and 25%, respectively, of the findings are expected to be false).

We studied whether power could be improved by grouping similar cell-types. For this purpose, we performed a principal components analysis (PCA) followed by a varimax rotation on the gene expression values of the panel. Cell-types with loadings > 0.5 on the same principal component were combined. The PCA suggested that six groups of two cell-types (Table S5) could potentially be combined leaving 11 cell-types. These PCA results corresponded very well with the UMAP plot (Fig. 1) showing proximity of these same cell-types. After grouping 68 genes reached transcriptome-wide significance when controlling the FDR controlled at the 0.1 level and 106 findings reached “suggestive” significance when controlling the FDR controlled at the 0.1 level FDR controlled at the 0.25 level. This the signal improved most likely because in this demonstration dataset the combined cell-types showed similar association signal.

Discussion

We propose a new brain reference panel that allows the detection of differentially expressed genes in human bulk brain data on a fine-grained cell-type specific level. We created the panel through an integrated (mega) analysis of the data from the seven largest s_nRNA-seq studies in brain after processing all data in exactly the same way. Our panel included 17 cell-types that were robustly detected across all studies, subregions of the prefrontal cortex, and sex and age groups.

Our goal was not to make a complete inventory of all cell-types in the PFC but to create a reference panel that can subsequently be used for cell-type specific association studies. Thus, cell-types that were not observed in all studies were omitted and we used settings in the cluster analyses that would avoid a very large number of cell-type clusters. We believe that the proposed panel strikes a good balance between being fine-grained but not to the extent that all the cell-type proportions can no longer be estimated precisely.

To estimate the cell-type proportions, we proposed an empirical Bayes estimator that yielded highly accurate and unbiased estimates even for the low abundant cell-types. Our estimator substantially outperformed the three estimators recommended previously after a comprehensive evaluation of methods to estimate cell-type proportions from brain transcriptome data¹⁶. Our panel contains a substantial number of cell-types, several of which had low abundances. A precise estimator is critical as to avoid that downstream analyses with low abundant cell-types produce unreliable results. Furthermore, our estimator has the desirable property that it uses a panel comprising mean expression levels rather than the nuclei level s_nRNA-seq data. This prevents working with very large data files and the need to apply for access to obtain all the nuclei level data from repositories.

Transcriptome-wide association studies performed with permuted bulk RNA-seq data showed that it is possible to perform TWASs for even the rarest cell-types without an increased risk of false positives. For example, even cell-types with frequencies as low as 1% yielded transcriptome-wide significant results in the absence of test statistic inflation. Furthermore, analyses showed that more significant findings were obtained by grouping similar cell-types. How, this finding may be specific for our demonstration dataset and it is very well possible that in other data sets fewer significant findings are obtained when grouped cell-type so not show associations with the same genes.

The proposed approach requires bulk gene expression data to estimate the cell-type proportions. However, once the cell-type proportions are estimated, we can study cell-type specific associations for any other bulk data generated for the same brain samples (e.g., microRNAs, methylation data, open chromatin data). Although s_nRNA-seq studies, and consequently our panel, involve gene expression, we can therefore also study individual transcripts. This is important as only specific transcripts of the gene may be differentially expressed. In these scenarios the study of expression at the gene level will dilute association signals and result in a loss of potentially critical biological information.

Whereas s_nRNA-seq assays nuclear RNAs with a poly A-tail (mainly mRNA), our bulk RNA-seq data assayed total RNA from the entire cells which may contain transcripts not present in the nucleus^64,65. However, the means of the cell-type proportion estimates in our bulk RNA-seq dataset were highly correlated with the cell-type proportion counts observed in the s_nRNA-seq data. This suggested that possible differences in expression levels between the panel genes in the nucleus versus the entire cell did not distort the cell-type proportion estimates.

Even with the advent of s_nRNA-seq, deconvolution methods are likely to remain pertinent for cell-type specific association studies with brain tissue. This is because the vast majority of existing gene expression data sets involves bulk samples. Deconvolution allows the (re-)use of this “legacy” data to study cell-type specific effects. Deconvolution methods can also potentially be useful to validate findings from s_nRNA-seq studies. Such validation with a different technology can eliminate possible false discoveries due to technology specific artefacts and therefore allows for more rigorous conclusions.

In summary, brain disorders are leading causes of disability world-wide. We proposed a new reference panel and precise estimator for the cell-type proportions that allows the use of bulk brain data to study brain disorders on a fine-grained cell-type specific level. The use of this approach may prevent that many cell-type specific disease associations remain undetected in studies with bulk data. Furthermore, identifying the specific cell-types from which association signals originate is key to formulating refined hypotheses about the etiology of brain disorders, designing proper follow-up experiments and, eventually, developing novel clinical interventions. The reference panel and easy-to-use accompanying analysis tools are publicly available.

Data availability

The datasets during the current study are available in the GEO (accession numbers GSE157827, GSE138852, GSE174367,GSE144136, GSE144136, GSE140231) and Synapse (accession number syn18642926). The panel and R scripts used for empirical Bayes estimation and its evaluation though simulations are available from GitHub: https://github.com/ejvandenoord/Empirical-Bayes-estimation-of-cell-type-proportions. RaMWAS is freely available from Bioconductor (https://bioconductor.org/packages/release/bioc/html/ramwas.html) and a script to perform cell-type specific association studies with RaMWAS is also provided on GitHub https://github.com/ejvandenoord/celltype_MWAS.

References

WHO. Depression and Other Common Mental Disorders: Global Health Estimates (World Health Organization, 2017).
Google Scholar
Murray, C. J. L. & Lopez, A. D. The Global Burden of Disease: A Comprehensive Assessment of Mortality and Disability from Diseases, Injuries, and Risk Factors in 1990 and Projected to 2020 (World Health Organization, 1996).
Google Scholar
Harris, E. C. & Barraclough, B. Excess mortality of mental disorder. Br. J. Psychiatry 173, 11–53 (1998).
CAS PubMed Google Scholar
Habib, N. et al. Massively parallel single-nucleus RNA-seq with DroNc-seq. Nat. Methods 14, 955 (2017).
CAS PubMed PubMed Central Google Scholar
Shen-Orr, S. S. & Gaujoux, R. Computational deconvolution: Extracting cell type-specific information from heterogeneous samples. Curr. Opin. Immunol. 25, 571–578 (2013).
CAS PubMed Google Scholar
Troubat, R. et al. Neuroinflammation and depression: A review. Eur. J. Neurosci. 53, 151–171 (2021).
CAS PubMed Google Scholar
Moylan, S., Maes, M., Wray, N. R. & Berk, M. The neuroprogressive nature of major depressive disorder: Pathways to disease evolution and resistance, and therapeutic implications. Mol. Psychiatry 18, 595–606 (2013).
CAS PubMed Google Scholar
Hartline, D. K. & Colman, D. R. Rapid conduction and the evolution of giant axons and myelinated fibers. Curr. Biol. 17, R29-35 (2007).
CAS PubMed Google Scholar
Neve, R. L. Overview of gene delivery into cells using HSV-1-based vectors. Curr. Protoc. Neurosci. 6(4), 4–12 (2012).
Google Scholar
Artusi, S., Miyagawa, Y., Goins, W. F., Cohen, J. B. & Glorioso, J. C. Herpes simplex virus vectors for gene transfer to the central nervous system. Diseases 6, 74 (2018).
CAS PubMed PubMed Central Google Scholar
Venet, D., Pecasse, F., Maenhaut, C. & Bersini, H. Separation of samples into their constituents using gene expression data. Bioinformatics 17(Suppl 1), S279–S287 (2001).
PubMed Google Scholar
Shen-Orr, S. S. et al. Cell type-specific gene expression differences in complex tissues. Nat. Methods 7, 287–289 (2010).
CAS PubMed PubMed Central Google Scholar
Hattab, M. W. et al. Correcting for cell-type effects in DNA methylation studies: Reference-based method outperforms latent variable approaches in empirical studies. Genome Biol. 18, 24 (2017).
PubMed PubMed Central Google Scholar
Zhu, L., Lei, J., Devlin, B. & Roeder, K. A unified statistical framework for single cell and bulk Rna sequencing data. Ann. Appl. Stat. 12, 609–632 (2018).
MathSciNet PubMed PubMed Central MATH Google Scholar
Krishnaswami, S. R. et al. Using single nuclei for RNA-seq to capture the transcriptome of postmortem neurons. Nat. Protoc. 11, 499–524 (2016).
CAS PubMed PubMed Central Google Scholar
Sutton, G. J. et al. Comprehensive evaluation of deconvolution methods for human brain gene expression. Nat. Commun. 13, 1358 (2022).
CAS PubMed PubMed Central ADS Google Scholar
Agarwal, D. et al. A single-cell atlas of the human substantia nigra reveals cell-specific pathways associated with neurological disorders. Nat. Commun. 11, 4183 (2020).
PubMed PubMed Central ADS Google Scholar
Brenner, E. et al. Single cell transcriptome profiling of the human alcohol-dependent brain. Hum. Mol. Genet. 29, 1144–1153 (2020).
CAS PubMed PubMed Central Google Scholar
Lau, S. F., Cao, H., Fu, A. K. Y. & Ip, N. Y. Single-nucleus transcriptome analysis reveals dysregulation of angiogenic endothelial cells and neuroprotective glia in Alzheimer’s disease. Proc. Natl. Acad. Sci. U. S. A. 117, 25800–25809 (2020).
CAS PubMed PubMed Central ADS Google Scholar
Mathys, H. et al. Single-cell transcriptomic analysis of Alzheimer’s disease. Nature 570, 332–337 (2019).
CAS PubMed PubMed Central ADS Google Scholar
Morabito, S. et al. Single-nucleus chromatin accessibility and transcriptomic characterization of Alzheimer’s disease. Nat. Genet. 53, 1143–1155 (2021).
CAS PubMed PubMed Central Google Scholar
Nagy, C. et al. Single-nucleus transcriptomics of the prefrontal cortex in major depressive disorder implicates oligodendrocyte precursor cells and excitatory neurons. Nat. Neurosci. 23, 771–781 (2020).
CAS PubMed Google Scholar
Velmeshev, D. et al. Single-cell genomics identifies cell type-specific molecular changes in autism. Science 364, 685–689 (2019).
CAS PubMed PubMed Central ADS Google Scholar
Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
CAS PubMed PubMed Central ADS Google Scholar
10x Genomics. Interpreting intronic and antisense reads in 10x genomics single cell gene expression data. Technical Note, CG000376. (2020).
Ding, J. et al. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat. Biotechnol. 38, 737–746 (2020).
CAS PubMed PubMed Central Google Scholar
La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).
PubMed PubMed Central ADS Google Scholar
Ameur, A. et al. Total RNA sequencing reveals nascent transcription and widespread co-transcriptional splicing in the human brain. Nat. Struct. Mol. Biol. 18, 1435–1440 (2011).
CAS PubMed Google Scholar
Peng, S. et al. Probing glioblastoma and its microenvironment using single-nucleus and single-cell sequencing. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2757–2762 (IEEE, 2019).
van Dijk, D. et al. Recovering gene interactions from single-cell data using data diffusion. Cell 174, 716–729 (2018).
PubMed PubMed Central Google Scholar
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
CAS PubMed PubMed Central Google Scholar
Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093–1095 (2013).
CAS PubMed Google Scholar
Dong, M. et al. SCDC: Bulk gene expression deconvolution by multiple single-cell RNA sequencing references. Brief Bioinform. 22, 416–427 (2021).
CAS PubMed Google Scholar
Jew, B. et al. Accurate estimation of cell composition in bulk expression through robust integration of single-cell information. Nat. Commun. 11, 1971 (2020).
CAS PubMed PubMed Central ADS Google Scholar
Finak, G. et al. MAST: A flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278 (2015).
PubMed PubMed Central Google Scholar
Houseman, E. A. et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinf. 13, 86 (2012).
Google Scholar
B, G., J, G., I, A. & Brilleman S (2022). “rstanarm: Bayesian applied regression modeling via Stan.” R package version 2.21.3. rstanarm: Bayesian applied regression modeling via Stan. (2022).
Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457 (2015).
CAS PubMed PubMed Central Google Scholar
Hunt, G. J., Freytag, S., Bahlo, M. & Gagnon-Bartsch, J. A. dtangle: Accurate and robust cell type deconvolution. Bioinformatics 35, 2093–2099 (2019).
CAS PubMed Google Scholar
Wang, X., Park, J., Susztak, K., Zhang, N. R. & Li, M. Bulk tissue cell type deconvolution with multi-subject single-cell expression reference. Nat. Commun. 10, 380 (2019).
CAS PubMed PubMed Central ADS Google Scholar
Newman, A. M. et al. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat. Biotechnol. 37, 773–782 (2019).
CAS PubMed PubMed Central Google Scholar
Sosina, O. A. et al. Strategies for cellular deconvolution in human brain RNA sequencing data. bioRxiv https://doi.org/10.1101/2020.01.19.910976 (2020).
Article Google Scholar
Sahraeian, S. M. E. et al. Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis. Nat. Commun. 8, 59 (2017).
PubMed PubMed Central ADS Google Scholar
Suzuki, N. et al. Differentiation of oligodendrocyte precursor cells from Sox10-venus mice to oligodendrocytes and astrocytes. Sci. Rep. 7, 14133 (2017).
PubMed PubMed Central ADS Google Scholar
Batiuk, M. Y. et al. Identification of region-specific astrocyte subtypes at single cell resolution. Nat. Commun. 11, 1220 (2020).
CAS PubMed PubMed Central ADS Google Scholar
McKinsey, G. L. et al. A new genetic strategy for targeting microglia in development and disease. Elife 9, e54590 (2020).
CAS PubMed PubMed Central Google Scholar
Walker, F. et al. Parvalbumin- and vasoactive intestinal polypeptide-expressing neocortical interneurons impose differential inhibition on Martinotti cells. Nat. Commun. 7, 13664 (2016).
CAS PubMed PubMed Central ADS Google Scholar
Song, H. W. et al. Transcriptomic comparison of human and mouse brain microvessels. Sci Rep 10, 12358 (2020).
MathSciNet CAS PubMed PubMed Central ADS Google Scholar
Gao, Y., Tatavarty, V., Korza, G., Levin, M. K. & Carson, J. H. Multiplexed dendritic targeting of alpha calcium calmodulin-dependent protein kinase II, neurogranin, and activity-regulated cytoskeleton-associated protein RNAs by the A2 pathway. Mol. Biol. Cell 19, 2311–2327 (2008).
CAS PubMed PubMed Central Google Scholar
Shi, L. & Bergson, C. M. Neuregulin 1: An intriguing therapeutic target for neurodevelopmental disorders. Transl. Psychiatry 10, 190 (2020).
CAS PubMed PubMed Central Google Scholar
Mouton-Liger, F. et al. CSF levels of the BACE1 substrate NRG1 correlate with cognition in Alzheimer’s disease. Alzheimers Res. Ther. 12, 88 (2020).
CAS PubMed PubMed Central Google Scholar
Chapman, R. M. et al. Convergent evidence that ZNF804A is a regulator of pre-messenger RNA processing and gene expression. Schizophr. Bull. 45, 1267–1278 (2019).
PubMed Google Scholar
Redies, C., Hertel, N. & Hubner, C. A. Cadherins and neuropsychiatric disorders. Brain Res. 1470, 130–144 (2012).
CAS PubMed Google Scholar
AlAyadhi, L. Y. et al. High-resolution SNP genotyping platform identified recurrent and novel CNVs in autism multiplex families. Neuroscience 339, 561–570 (2016).
CAS PubMed Google Scholar
Prokopenko, D. et al. Whole-genome sequencing reveals new Alzheimer’s disease–associated rare variants in loci related to synaptic function and neuronal development. Alzheimer Dementia 17(9), 1509–1527 (2021).
CAS Google Scholar
Daneshmandpour, Y., Darvish, H. & Emamalizadeh, B. RIT2: Responsible and susceptible gene for neurological and psychiatric disorders. Mol. Genet. Genomics 293, 785–792 (2018).
CAS PubMed Google Scholar
Qiu, L., Yu, H. & Liang, F. Multiple C2 domains transmembrane protein 1 is expressed in CNS neurons and possibly regulates cellular vesicle retrieval and oxidative stress. J. Neurochem. 135, 492–507 (2015).
CAS PubMed Google Scholar
Cauda, F. et al. Functional anatomy of cortical areas characterized by Von Economo neurons. Brain Struct. Funct. 218, 1–20 (2013).
PubMed Google Scholar
Brune, M. et al. Von Economo neuron density in the anterior cingulate cortex is reduced in early onset schizophrenia. Acta Neuropathol. 119, 771–778 (2010).
PubMed Google Scholar
Gefen, T. et al. Von Economo neurons of the anterior cingulate across the lifespan and in Alzheimer’s disease. Cortex 99, 69–77 (2018).
PubMed Google Scholar
Hakeem, A. Y. et al. Von Economo neurons in the elephant brain. Anat. Rec. (Hoboken) 292, 242–248 (2009).
PubMed Google Scholar
Fajardo, C. et al. Von Economo neurons are present in the dorsolateral (dysgranular) prefrontal cortex of humans. Neurosci. Lett. 435, 215–218 (2008).
CAS PubMed Google Scholar
van den Oord, E., Xie, L. Y., Tran, C. J., Zhao, M. & Aberg, K. A. A targeted solution for estimating the cell-type composition of bulk samples. BMC Bioinf. 22, 462 (2021).
Google Scholar
Chen, L. A global comparison between nuclear and cytosolic transcriptomes reveals differential compartmentalization of alternative transcript isoforms. Nucleic Acids Res. 38, 1086–1097 (2010).
CAS PubMed Google Scholar
Abdelmoez, M. N. et al. SINC-seq: Correlation of transient gene expressions between nucleus and cytoplasm reflects single-cell physiology. Genome Biol. 19, 66 (2018).
PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This research was supported by grants R01MH109525 and R01MH124981 from the National Institute of Mental Health. The sponsors had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication.

Author information

Authors and Affiliations

Center for Biomarker Research and Precision Medicine, Virginia Commonwealth University, McGuire Hall, Room 216A, 1112 East Clay Street, P. O. Box 980533, Richmond, VA, 23298-0581, USA
Edwin J. C. G. van den Oord & Karolina A. Aberg

Authors

Edwin J. C. G. van den Oord
View author publications
You can also search for this author in PubMed Google Scholar
Karolina A. Aberg
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

K.A. and E.O. conceived the research question. E.O. performed the analyses and wrote the first draft of the article. Both authors edited the article and approved the final draft.

Corresponding author

Correspondence to Edwin J. C. G. van den Oord.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information 1.

Supplementary Information 2.

Supplementary Information 3.

Supplementary Information 4.

Supplementary Information 5.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

van den Oord, E.J.C.G., Aberg, K.A. Fine-grained cell-type specific association studies with human bulk brain data using a large single-nucleus RNA sequencing based reference panel. Sci Rep 13, 13004 (2023). https://doi.org/10.1038/s41598-023-39864-2

Download citation

Received: 25 July 2022
Accepted: 01 August 2023
Published: 10 August 2023
DOI: https://doi.org/10.1038/s41598-023-39864-2

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.