Single-cell RNA sequencing of mouse brain and lung vascular and vessel-associated cell types

Vascular diseases are major causes of death, yet our understanding of the cellular constituents of blood vessels, including how differences in their gene expression profiles create diversity in vascular structure and function, is limited. In this paper, we describe a single-cell RNA sequencing (scRNA-seq) dataset that defines vascular and vessel-associated cell types and subtypes in mouse brain and lung. The dataset contains 3,436 single cell transcriptomes from mouse brain, which formed 15 distinct clusters corresponding to cell (sub)types, and another 1,504 single cell transcriptomes from mouse lung, which formed 17 cell clusters. In order to allow user-friendly access to our data, we constructed a searchable database (http://betsholtzlab.org/VascularSingleCells/database.html). Our dataset constitutes a comprehensive molecular atlas of vascular and vessel-associated cell types in the mouse brain and lung, and as such provides a strong foundation for future studies of vascular development and diseases.


Background & Summary
The blood vasculature is built from two principal cell classes: endothelial cells, which line the blood vessel lumens, and mural cells, which surround and/or stretch along the endothelial tubes. Mural cell is a collective term for pericytes and vascular smooth muscle cells (SMCs). Pericytes are broadly defined as the mural cells of microvessels, whereas SMCs occupy arteries and veins. In spite of clear differences in anatomical location and cell morphology, insight into the molecular and functional differences of mural cell subtypes is still limited 1,2 . Concerning other vessel-associated cell types, large arteries and veins harbor a clearly distinguishable outer layer-the adventitia-that contains fibroblast-like cells and extracellular matrix (ECM). However, a more exact definition of the adventitial ECM-producing cells, functionally as well as transcriptomically, is still missing. The presence of adventitial cells along smaller arterial and venous branches is also poorly understood. In the mouse brain, a new type of perivascular/leptomeningeal cell was recently pinpointed 3 , but a more exact anatomical and molecular description of these cells was still missing. In order to achieve a molecular understanding of the constituent cell types, using single cell RNA sequencing (scRNA-seq), we transcriptionally profiled vascular and vessel-associated cells in brain and lung 4 . Here, we provide a Data Descriptor for this dataset (Fig. 1a).
To capture vascular and vessel-associated cell types from the adult mouse brain, we used a set of transgenic reporter mice: Cldn5(BAC)-GFP for endothelial cells, Pdgfrb(BAC)-eGFP;Cspg4-DsRed for mural cells and Pdgfra-H2BGFP for perivascular fibroblast-like cells (Fig. 1b). We also took advantage of an unexpected reporter gene expression in vessel-associated astrocytes from the Tagln-Cre; R26-stop-tdTomato mouse to capture vessel-associated astrocytes. To capture vascular and vesselassociated cell types from the adult mouse lung, we used Cldn5(BAC)-GFP for lung endothelial cells, Pdgfrb(BAC)-eGFP;Cspg4-DsRed and Pdgfrb(BAC)-eGFP for mural cells. Single fluorescent cells were sorted into 384-well plates, lysed, and the mRNA was converted into cDNA libraries using the SmartSeq2 protocol and sequenced 4 . We generated scRNA-seq transcriptomes from 3,436 single cells from the brain and 1,504 single cells from the lung (Data Citation 1). For each organ, the single cell transcriptomes were clustered using BackSPIN (Fig. 1c-e). After manual inspection and annotation, we defined 15 cell clusters in the brain, which following annotation using known canonical markers for the established vascular cells types along with validation of cell subtypes using immunofluorescence and in situ hybridization methods were found to correspond to: pericytes, three types of vascular smooth muscle cells (venous, arteriolar and arterial), microglia, two types of fibroblast-like cells, oligodendrocyte-lineage cells, six types of endothelial cells (venous, capillary, arterial and three others) and astrocytes (Fig. 2a). In the lung, we defined 17 cell clusters. Because our main objective with the lung dataset was to compare brain and lung pericytes, the annotation process of lung cells other than pericytes and endothelial cells was less extensive, but nevertheless indicated the existence of several subtypes of fibroblasts (split in four clusters) and cartilage/perichondrium-related cells (two clusters), pericytes (one cluster), vascular smooth muscle cells (one cluster), and at least two distinct types of endothelial cells (split into eight clusters) (Fig. 2b). To allow the scientific community to contribute to the further annotation of these cell types by assessing their gene expression, we provide user-friendly access to our data in the form of a searchable database http://betsholtzlab.org/ VascularSingleCells/database.html, in which any gene can be searched by acronym, and its expression across the analyzed cell types in brain and lung displayed as single-cell bar-plots as well as diagrams displaying average values for the expression in the different cell types (see Fig. 3a-d for an example).
The dataset in this Data Descriptor provides a first comprehensive molecular profile of vascular and vascular-associated cell types in mouse brain, and a preliminary analysis of vascular and mesenchymal cell types in the lung, the latter complementing recently published single cell data on lung mesenchyme 5,6 . Our dataset provides a foundation for future studies of vascular development, homeostasis and diseases.

Methods
The descriptions of the method protocols below are reproduced and extended from our related research publication 4 , with added details on computational data processing steps.

Isolation of single cells
The preparation of the heart and lung tissue for single cell analysis has been described in our related publication 4 , as well as in the following three papers in Protocol Exchange: Brain cell isolation: All libraries prepared for this study were sequenced on a HiSeq2500, using single 50 base pair reads and dual indexing.

Alignment and generation of counts
The RNA-seq aligner, Spliced Transcripts Alignment to a Reference (STAR, version 2.4.2a) was used to align the short reads to the mouse reference genome (mm10) 9 . The aligner is available for downloading at https://github.com/alexdobin/STAR. Two-pass alignment was chosen to have improved performance of de novo splice junction reads, filtered for only uniquely mapping reads. The STAR parameters are as follows: STAR --runThreadN 1 --genomeDir mm10 --readFilesIn XXX.fastq.gz --readFilesCommand zcat --outSAMstrandField intronMotif --twopassMode Basic The expression values were computed per gene as described in Ramsköld et al. 10 , using uniquely aligned reads and correcting for the uniquely alignable positions using MULTo57(ref. 11). As QC threshold, cells with less than 100,000 reads were discarded, as well as cells that had a Spearman correlation below 0.3.  Our analyses and cell type annotations were based on 3,186 brain vascular-associated cells, 1,504 lung vascular-associated cells and 250 brain astrocytes, which were obtained in parallel experiments using different reporter mice and partly different procedures to obtain the cells (see ref. 4). Therefore, in order to compare the gene expression counts across different cells, the total gene counts for each cell were normalized to 500,000. The R code used for the normalization is available in the Supplementary File 1. The R tsne packages (version 0.1.3) was applied to visualize the 2D t-SNE map and GGally packages (version 1.3.1) was used to make gene pairs plot.

Cell type classification with BackSPIN
As a clustering method, the BackSPIN algorithm 12   1,000  Online database construction The expression database was constructed using html and javascript. For each gene, four figures were pre-made and stored on the server for faster display (see Fig. 3a-d for an example), including: the detailed expression in each cell in the brain dataset (Fig. 3a); the average expression level in each of the 15 clusters in the brain (Fig. 3b); the detailed expression in each cell in the lung dataset (Fig. 3c) and the average expression level in each of the 17 clusters in the lung (Fig. 3d). The gene symbol auto-  complete function was implemented using the jquery.autocomplete.min.js and jquery-1.9.1.min.js plugin (available from https://github.com/devbridge/jQuery-Autocomplete/). The html page source and javascript code of the online database is available online at http://betsholtzlab.org/VascularSingle-Cells/database.html. In order to identify enriched genes in specific brain cell type(s), the average expression for each cell types was stored in a MySQL (version 5.0.12-dev) database table and user queries were passed through a PHP (version 7.0.23) script to the MySQL database.

Code availability
The R code used to process the sequencing data and visualize the results is available in the Supplementary File 1 (R version 3.3.2).

Data Records
The information table for all the cells used in this study is available on Figshare (Data Citation 1). All sequence data and counts matrixes have been deposited in Gene Expression Omnibus database (Data Citation 2-4).

Technical Validation
Quality control of single cell sequencing cDNA and libraries For each experiment, two different plate layouts were used for the FACS-based sorting. One plate (termed the 'sample plate') received one cell in each well of a 384 well plate and was used to obtain the data. The other plate (referred to as the 'validation plate') only contained lysis buffer in the first two columns, and received cells in the following pattern: Twenty cells in A1 and A2, no cells in P1 and P2, and one cell in the rest of column 1 and 2. The validation plate was used both as a quality control for sorting efficiency as well as allowing cDNA amplification optimization prior to proceeding with all 384 cells of the sample plate.
It is impossible to reliably measure mRNA quantity and quality of a single cell without prior amplification of the minute amount of RNA, and thus the first quality control check was done on the validation plate after cDNA synthesis and 22 PCR cycles of amplification. The cDNA quality and concentration were assessed using a DNA high sensitivity assay on a TapeStation 4200 or BioAnalyzer (Agilent Technologies) (Fig. 4a). A major size distribution around 1,500 base pairs indicated intact mRNAs and good quality of cDNA synthesis, sufficient for library preparation. A large cDNA size distribution between 100 and 500 base pairs indicated mRNA degradation and was not processed for library synthesis (data not shown). If the validation plate passed the quality control, the sample plate was processed in the same way as the validation plate. If needed, the PCR cycles could be increased to enrich the cDNA further, yet this has proven unnecessary in this study. After tagmentation and indexing of the cDNA, the libraries were pooled and assessed for quantity and quality with a High Sensitivity DNA chip on a BioAnalyzer (Agilent Biotechnologies) (Fig. 4b).

Technical validation of the data
Beyond the quality control measures described above, additional steps were taken to ensure the validity of the data. To validate the clustering result based on BackSPIN result, the brain and lung single cell data were independently analyzed by the T-Distributed Stochastic Neighbor Embedding (t-SNE) method (Fig. 2a,b). In both the brain and lung data, the t-SNE result spread the single cells in 2-D space and revealed several cell groups. When overlaying and color-coding the cluster result from BackSPIN analysis on t-SNE result, the two methods showed good concordance in general. To check the possible batch effect for the major cell classes in the dataset (endothelial cells and mural cells), two sample plates were sorted per mouse, allowing assessment of technical variation between sample processing. As exemplified for the four plates of endothelial cells (Fig. 5a) and four plates of mural cells (Fig. 5b), the mouse origin as well as  the technical replicates were color-coded. No significant differences could be found between the different plates or different animals when visualizing the data with t-SNE and displaying expression barplot of cells order by BackSPIN. In our dataset, it is common to see a strong variation in gene expression levels between individual cells. These most likely reflect stochastic events from either biological origin (i.e. burst expression of genes) or experimental origin (incomplete capture rate of mRNA by Smart-Seq2 and strong PCR amplification). In order to rule out that inter-cellular differences in gene expression could be a reflection of library quality, we hypothesized that library quality would be correlated to gene expression. Therefore, we analyzed 5 highly expressed pericyte specific genes for correlation of expression of these genes within the cluster (Fig. 6). No correlation could be found, suggesting that inter-cellular differences of expression are not related to library quality. In addition, we also analyzed 5 genes that were broadly expressed in the whole dataset, and again, no correlation could be found (Fig. 7). Thus, we complement the QC on our dataset with a new type of analysis indicating that cell-to-cell variation of gene expression within the same cell population is a stochastic event.