Brain Cell Type Specific Gene Expression and Co-expression Network Architectures

Elucidating brain cell type specific gene expression patterns is critical towards a better understanding of how cell-cell communications may influence brain functions and dysfunctions. We set out to compare and contrast five human and murine cell type-specific transcriptome-wide RNA expression data sets that were generated within the past several years. We defined three measures of brain cell type-relative expression including specificity, enrichment, and absolute expression and identified corresponding consensus brain cell “signatures,” which were well conserved across data sets. We validated that the relative expression of top cell type markers are associated with proxies for cell type proportions in bulk RNA expression data from postmortem human brain samples. We further validated novel marker genes using an orthogonal ATAC-seq dataset. We performed multiscale coexpression network analysis of the single cell data sets and identified robust cell-specific gene modules. To facilitate the use of the cell type-specific genes for cell type proportion estimation and deconvolution from bulk brain gene expression data, we developed an R package, BRETIGEA. In summary, we identified a set of novel brain cell consensus signatures and robust networks from the integration of multiple datasets and therefore transcend limitations related to technical issues characteristic of each individual study.

For all three murine data sets, the gene symbols were converted from MGI symbols to HGNC symbols by using the Ensembl database, accessed through biomaRt (version 2.28) 6 . If no homolog existed, then the original mouse symbol was retained. In the case of multiple homologous genes, we selected the human gene with the highest homology percentage based on protein coding region DNA divergence. In the case that multiple transcripts mapped to the same gene symbol, the gene symbol with the maximum average expression count was retained for subsequent analyses 7 .

Comparison of cell type signatures to PubMed text mining results
For the top 100 most enriched genes in each cell type, we performed PubMed (https://www.ncbi.nlm.nih.gov/pubmed) searches and counted the number of results for the search "Gene AND CellType" (Supplementary File 2), where Gene refers to the character string of the enriched gene symbol and CellType refers to the character string of the enriched cell type, either "astrocyte", "endothelial", "microglia", "neuron", "oligodendrocyte", or "oligodendrocyte precursor". We used the Mann-Whitney U test to compare the number of PubMed mentions found among the corresponding most enriched gene set, compared to the genes most enriched in all of the other cell types, with the exception of oligodendrocytes and OPCs, which were not included in each other's control sets for comparison because of their similarity. We also found the Spearman correlation between the enrichment for each gene in that cell type and the number of PubMed mentions for the gene symbol/cell type combination for each cell type.

Comparison of the three cell-type associated gene expression measurements
We first selected genes such that they needed to be within the top 1000 genes as ranked by one of the measures at a time. We used this filtering method because, on one hand, we needed to remove low-ranked genes because the measures were not designed to be sensitive to differences in gene ranks at the lower ranks, because the expression of a gene in a given cell type for all data sets may be negligible. On the other hand, if we were to filter genes such that they ranked highly in at least one of the categories, we would have introduced a spurious association between the two variables, as a result of Berkson's bias. Therefore, we calculated the Spearman (rank) correlation between gene rankings for each of the three measures after filtering for the top ranked genes in one measure at a time, for a total of unique six comparisons, for each of the six cell types.

Estimation of relative cell type proportions from bulk RNA expression data
To make relative cell type proportion estimates, we adapted the previously validated singular value decomposition (SVD) method from CellCODE 8 , which has been implemented in the BRETIGEA (BRain cEll Type specIfic Gene Expression Analysis) R package (version 1.0). Specifically, after scaling the data, the algorithm in BRETIGEA calculates the first singular vector of a variable number of cell type-specific (marker) genes as an estimate of the relative cell type proportion in those samples. Because the signs of this singular vector can be reversed, we next find the mean correlation between the singular vector and each of the marker genes, and if this is less than zero, then the sign of all of the values in the singular vector are switched. Note that these estimated values can be less than zero for a given sample, which is part of why it is a relative cell type proportion estimate, rather than an absolute cell type proportion estimate. BRETIGEA also offers principal component analysis (PCA) as an option for cell type relative proportion estimation, along with sets of 1000 top consensus cell-specific genes for each cell type.
We used matched RNA expression and immunohistochemistry marker data from the Aging, Dementia, and TBI study from the Allen Brain Atlas 9 as an additional validation of the marker genes identified in bulk brain gene expression data. In this study, investigators performed both RNA-seq and immunohistochemistry (IHC) quantification for protein levels of the astrocyte marker gene GFAP and the microglia marker gene IBA1 in brain samples from the same donor across four brain regions (frontal white matter, hippocampus, temporal cortex, and parietal cortex). We downloaded the normalized RPKM values from the study website (http://aging.brain-map.org/download/index), which has been adjusted for RNA integrity number (RIN) and batch. The IHC quantifications provide an independent measurement for the relative levels of the cell types in each brain sample, which can be used to validate our relative cell type proportion estimates from the RNA expression data. Using all four brain regions available in the Allen Brain Atlas data set, we first calculated 100 SVD-based estimates of relative cell type proportions for astrocytes and microglia in each sample using the top 1-100 marker genes. We correlated all of these estimates with the independent IHC quantifications for GFAP and IBA in those samples. We also found the rank correlation between each of the individual top 100 marker genes for astrocytes and microglia with the independent IHC quantifications for GFAP and IBA1, respectively.
As an additional exploratory analysis, we estimated the relative proportion of astrocytes and microglia using the SVD-based method within each of the 4 individual brain regions and calculated the rank correlations of these estimates with the IHC quantifications for GFAP and IBA1, respectively, in each of those brain regions.

ATAC-seq validation of novel cell markers
To validate our novel cell marker genes, we performed ATAC-seq on 50mg of frozen postmortem brain tissue from the dorsolateral prefrontal cortex from four controls (i.e. individuals that had no history of neuropsychiatric disease or substance abuse). Human brains from normal control Caucasian subjects without head trauma were collected at autopsy within 24 h after death at the Department of Forensic Medicine, Semmelweis University and at the National Institute of Forensic Medicine (Karolinska Institutet). All material was obtained under approved local ethical guidelines. Nuclei were processed for fluorescence activated nuclear sorting (FANS) as described previously 10 . ATAC-seq was performed on FANS sorted nuclei to assess chromatin accessibility following an established protocol 11 . We mapped ATAC-seq reads with STAR 2.50 and the following settings: --alignIntronMax 1 --outFilterMismatchNmax 100 --alignEndsType EndToEnd --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3. To rule out sample mislabelling and contamination, genotypes were called with GATK 3.50 and the genetic concordance between all combinations of samples calculated. Peaks were subsequently called with the MACS 2.1 narrow peaks algorithm with the following settings --keep-dup all --shift -100 --extsize 200 --nomodel. We created a consensus set of peaks by keeping only those that were found in two or more samples. Read counts were then quantified within these peaks using RSubread 1.15 and the following settings: allowMultiOverlap=F, isPairedEnd=T, strandSpecific=0, requireBothEndsMapped=F, minFragLength=0, maxFragLength=2000, checkFragLength=T, countMultiMappingReads=F, countChimericFragments=F. This yielded a peak by sample expression matrix, from which peaks mapping to the ENCODE blacklisted regions were excluded. Peaks that had very few read counts (defined as peaks that had less than 1 count per million in 90% or more of the samples) were then discarded. For the remaining peaks, we normalized the read counts using "TMM" normalization followed by conversion of read counts to counts per million. We then constructed a peak by gene mapping by considering a peak mapped to a gene if a TSS overlapped the peak. Most often a gene overlapped, at most, one peak, but when it overlapped multiple peaks due to multiple transcripts we took the sum of the reads within the peaks.
Additionally, genes not having a TSS overlapping an ATAC-seq peak were left out of further analyses. This To visualize the gene ontology (GO) enrichment of distinct modules in the multiscale network, we filtered for modules containing between 5 and 2000 gene members, and selected only those modules with at least one gene that was specific to that module, thus removing exclusively parent modules. We used the moduleGO function in DGCA 12 (version 1.0.1) to perform gene ontology (GO) enrichment analysis on gene modules, which leverages the GOstats (version 2.34) 13 and org.Hs.eg.db GO annotation (version 3.1.2) R packages. To identify GO terms with potential brain cell type-specific activity, we filtered for those GO terms with less than 500 gene symbols. We adjusted the enrichment p-values for all GO terms in each module using the Benjamini-Hochberg (BH) method. We used Fisher's Exact Test (FET) to identify cell type enrichments for each module, using the top 500 genes ranked by consensus cell type specificity from each cell type in the human data sets as the cell type enrichment signatures, and adjusted the module enrichment p-values for each cell type signature using the BH method. The background universe size used in the FETs was the number of unique gene symbols found in both the human and mouse data sets. To identify module conservation, we found the FET enrichment of the intersection of all the modules identified in each cell type from the Darmanis et al.

Supplementary Figures
Supplementary Figure 1. Pairwise data set comparison of cell type specificity rankings for each cell type. Plots of the fold-enrichment of the intersections of genes ranked in the top n genes (where n = 10, 20, 50, 100, 200, 500, 1000) between pairs of data sets for the cell type specificity measure. The data sets were merged to only include gene symbols common to both prior to calculating the fold enrichment score. A fold enrichment of 0 indicates that no genes were found in the intersection of those two sets of top genes. Figure 2. Pairwise data set comparison of cell type absolute expression rankings for each cell type. Plots of the fold-enrichment of the intersections of genes ranked in the top n genes (where n = 10, 20, 50, 100, 200, 500, 1000) between pairs of data sets for the cell type absolute expression measure. The data sets were merged to only include gene symbols common to both prior to calculating the fold enrichment score. A fold enrichment of 0 indicates that no genes were found in the intersection of those two sets of top genes. confidence intervals. Note that the discontinuities present in the plots of specificity and enrichment rankings are due to the set of genes with no measured expression in this cell type. The rank correlation and associated pvalue for each comparison are noted in the bottom right of each plot. Figure 6. Comparison of the three cell type-associated measures in endothelial cells. The ranks of the top 1000 genes in each of the three cell type-associated measures (y-axis) compared to the ranks of the other two measures (x-axes; see the Supplementary Methods for more information). Each point represents the rank of one gene according to each measure. The points are partially transparent (alpha = 0.1) to mitigate overplotting. The black line is a result of a linear model fit to the data, while the grey lines represent 95% confidence intervals. Note that the discontinuities present in the plots of specificity and enrichment rankings are due to the set of genes with no measured expression in this cell type. The rank correlation and associated pvalues for each comparison are noted in the bottom right of each plot. Figure 7. Comparison of the three cell type-associated measures in microglia. The ranks of the top 1000 genes in each of the three cell type-associated measures (y-axis) compared to the ranks of the other two measures (x-axes; see the Supplementary Methods for more information). Each point represents the rank of one gene according to each measure. The points are partially transparent (alpha = 0.1) to mitigate overplotting. The black line is a result of a linear model fit to the data, while the grey lines represent 95% confidence intervals. Note that the discontinuities present in the plots of specificity and enrichment rankings are due to the set of genes with no measured expression in this cell type. The rank correlation and associated p-values for each comparison are noted in the bottom right of each plot. Figure 8. Comparison of the three cell type-associated measures in neurons. The ranks of the top 1000 genes in each of the three cell type-associated measures (y-axis) compared to the ranks of the other two measures (x-axes; see the Supplementary Methods for more information). Each point represents the rank of one gene according to each measure. The points are partially transparent (alpha = 0.1) to mitigate overplotting. The black line is a result of a linear model fit to the data, while the grey lines represent 95% confidence intervals. The black line is a result of a linear model fit to the data, while the grey lines represent 95% confidence intervals. Note that the discontinuities present in the plots of specificity and enrichment rankings are due to the set of genes with no measured expression in this cell type. The rank correlation and associated pvalues for each comparison are noted in the bottom right of each plot.