Spatial cell type composition in normal and Alzheimers human brains is revealed using integrated mouse and human single cell RNA sequencing

Single-cell RNA sequencing (scRNA-seq) resolves heterogenous cell populations in tissues and helps to reveal single-cell level function and dynamics. In neuroscience, the rarity of brain tissue is the bottleneck for such study. Evidence shows that, mouse and human share similar cell type gene markers. We hypothesized that the scRNA-seq data of mouse brain tissue can be used to complete human data to infer cell type composition in human samples. Here, we supplement cell type information of human scRNA-seq data, with mouse. The resulted data were used to infer the spatial cellular composition of 3702 human brain samples from Allen Human Brain Atlas. We then mapped the cell types back to corresponding brain regions. Most cell types were localized to the correct regions. We also compare the mapping results to those derived from neuronal nuclei locations. They were consistent after accounting for changes in neural connectivity between regions. Furthermore, we applied this approach on Alzheimer’s brain data and successfully captured cell pattern changes in AD brains. We believe this integrative approach can solve the sample rarity issue in the neuroscience.

Step 3: Matching homologs between Mouse scRNA-Seq and Human AHBA Microarray After the feature selection step the mRMR-selected genes were then paired with microarray data obtained from AHBA donors individually. The pairing further reduced the number of genes to a subset of genes that were contained in both AHBA and the scRNA-Seq datasets. MusNG contained 423 genes, HumN contained 382 genes, and HumNG contained 62 genes.
Step 4: Selecting Concordant Homologs Between Mouse scRNA-Seq and Human AHBA Microarray These remaining genes were then center normalized (so that all cell types do not have a higher mean expression across all genes), log2 transformed and converted to a z score. For HumN and HumNG the missing cell types were filled in using the more complete MusNG dataset such that HumN contained 2 human and 7 mouse cell types and HumNG contained 5 human and 3 mouse cell types. Because the AHBA data is fundamentally different due to platform and species genes needed to be selected that behaved the same across the datasets. This introduced a problem in that a traditional correlation matrix method cannot be computed because the microarray and scRNA-Seq samples are not paired by independent variables. However, gene lists ranked by correct classification of brain regions correlate between mouse and human 3 and mouse and human genes can be selected that will correctly cluster cell types regardless of species 4 . Based on this evidence, we developed a rank based approach using the difference in z-scores of both log2 transformed microarray and scRNA-Seq data matrices to identify concordant homologs. An allowable difference in standard deviation (standard deviation constant) was chosen that corresponded to the maximum allowable change between the mean z score for each gene across all samples.
To attain this, a vector of mean expression (each element for one gene) across all microarray samples was used to compare against the scRNA-Seq data.
Where ! is expression vector from sample . Next a vector was generated from the RNA-Seq CTEPs -a weighted average of all CTEPs initialized to a uniform distribution ( ). Where is the selected gene set, ∈ ℝ !"#$%!(!)×! is the microarray expression matrix (columns are samples and rows are genes), is the set of all genes, is the standard deviation constant and is a gene that is within d units of itself between the RNA-Seq and microarray dataset. The predicted proportions ( ) of each CTEP in were calculated using the retained set of concordant gene features ( ).
Where is the weight of principle component j in the linear combination of principle components in ! , while ! is the expression profile of principle component . To retrieve the cell type specific proportions (∝) from the PCR, can be transformed back into the original variables (∝) using the loading matrix .
∝= ! , ∈ ℝ !×!"#$ (Eq 10.) However, the main problem encountered when using PCR to deconvolute cell types is that the proportion of each cell type is not by definition non-negative. To avoid this problem, non-negative matrix factorization (NMF) can be used instead to create a lower dimension matrix from the original cell type matrix with c columns, corresponding to each CTEP and kept rows corresponding to the selected features (genes). NMF is a technique where a matrix can be approximated into the product of two non-negative factor matrices 9,10 . Let be the raw expression matrix.
≈ , ∈ ℝ !"#$×! , ∈ ℝ !"#$×! , ∈ ℝ !×! (Eq 11.) The non-negative matrix factorization of produces and which can be thought of as the non-negative score matrix and loading matrix respectively. By treating the non-negative approximation matrix as the principal components and the non-negative fit function used in the OLS model, non-negative values are produced that can then be used as estimates of cell type densities. Eq 9 and 10 can be rewritten in terms of and to attain non-negative estimates. ! = + (Eq 12.) ∝= ! (Eq 13.)

Significance testing of cell type correlations
To test whether the predicted cell type proportions were consistent in their spatial distribution regardless of the starting dataset (MusNG, HumN, HumNG), we calculated correlations and performed significance testing. As described in the main text we calculated the correlation of each cell type to itself across starting datasets. These correlations were used in the main text Fig. 3. We also calculated every combination of cell types within dataset and across datasets for each starting dataset comparison. This resulted in a correlation matrix where all cell type correlations could be studied (Supplementary Fig. 4-10). Since the overall cross dataset correlations are of interest we also display the mean correlation across all brains for each of the three comparisons (MusNG+HumN, MusNG+HumNG, HumN+HumNG, Supplementary Fig. 4). We find that in this overall comparison the same cell type comparisons are always more positively correlated than mismatched cell type correlations (Supplementary Table I).

Principal cell types visualized across the entire AHBA
To show high-level spatial distributions of principal cell types, all six brains across all cell types are combined into single 3D representations for each of the three scRNA-Seq datasets. SVD was performed on the cell-type proportion matrix, and the three largest components were used as the principal cell types across all samples in each brain to display the results as a three-digit color vector. This analysis is performed for each of the six AHBA brains using each of the three scRNA-Seq datasets individually. The principal cell types are displayed in 3D using color to represent the top three principal cell types. The three scRNA-Seq datasets are displayed separately such that all six AHBA brains are manually overlaid for each scRNA-Seq dataset. This registration produces consistent anatomic locations across the six overlaid brains. Displays are generated with MATLAB function scatter3 (The Mathworks, Inc.; Natick, MA, USA) to view sagittal, coronal, and axial images.
To visualize the structural patterns among estimated cell types, we applied singular value decomposition (SVD) to estimated cell-type data from the AHBA to reduce each mapped cell type to three principal types, which then could be displayed in an R^3 color vector. The 3D output for each of the six brains was overlaid by anatomic location. This visualization showed that there were unique patterns associated with cerebrum, brainstem, and cerebellum brain regions ( Figure 7A-I). For example, the cerebrum displays a cell-type pattern that is distinct from that of both the brainstem and cerebellum. In contrast, although the brainstem and cerebellum exhibited differences from one another, the cell types within these two regions were similar ( Figure 7A-I). These patterns were consistent across brains and among the input scRNA-Seq datasets that were used to deconvolute the samples ( Figure 7A-I).
We also evaluated the principal cell types in each major brain region visually ( Figure 7J-L, Supplementary Table III). The brainstem was generally comprised of CA1 pyramidal and glial cell types. The cerebellum was comprised of interneurons and various glial cell types ( Figure 7J-L, Supplementary Table III). Though detailed information on specific cell-type locations are not common, it is worth noting that some patterns are consistent with known locations; for example, ependymal cells localized to the spinal cord and ventricular regions.

Supplementary Figures
Supplementary Figure 1 To have comparable datasets some cell types had to be used from the mouse dataset to complete the two human datasets.  Figure 2 This plot represents the empirical study of the likelihood of acquiring an intersection of 62 features from 2 randomly selected samples described in the discussion section. Left is the distribution of intersections between two randomly selected datasets. Right is the CDF of that distribution with the intersection found during our study marked by the red dashed line.  Step size

Mic-Hum
End-Hum   Table 2).  Table 2).  Table 2).  Table 2). The random dataset constitutes a five-fold randomization of the sample labels such that it reflects the background noise and spurious correlation. Each row and column are a brain donor. Anova p-value between random and non-random donors is 1.1799×10 -23 .  The random dataset constitutes a five-fold randomization of the sample labels such that it reflects the background noise and spurious correlation. Each row and column are a brain donor. Anova p-value between random and non-random donors is 1.5364×10 -27 .  The random dataset constitutes a five-fold randomization of the sample labels such that it reflects the background noise and spurious correlation. Each row and column are a brain donor. Anova p-value between random and non-random donors is 2.1421×10 -11 .  Cerebellum structures and major brain regions marked. B) PCA plot of the cell type proportion matrix such that the 9 cell types are reduced to 2 principal cell types for each of the samples in the brain. The colors indicate the three regions (cerebrum, brainstem, and cerebellum). C) The proportion of each cell type plotted from a sagittal view; Red: high cell proportion (1); Black: low cell proportion (0) on a scale 0-1.

PC1 (72.7%)
(1) Interneurons (2) S1 Pyramidal (3) CA1 Pyramidal Cerebellum structures and major brain regions marked. B) PCA plot of the cell type proportion matrix such that the 9 cell types are reduced to 2 principal cell types for each of the samples in the brain. The colors indicate the three regions (cerebrum, brainstem, and cerebellum). C) The proportion of each cell type plotted from a sagittal view; Red: high cell proportion (1); Black: low cell proportion (0) on a scale 0-1.

Supplementary Table II
Each row is a combination of scRNA-Seq dataset and brain donor each corresponds to a spatial distribution figure (e.g. Figure 5B, Supplementary Figures 14-30B). The overall accuracy for k-means clusters as well as the cluster-wise sensitivity, cluster-wise specificity, and cluster-wise F-score are contained in each column.