Single Cell RNA Sequencing of stem cell-derived retinal ganglion cells

We used human embryonic stem cell-derived retinal ganglion cells (RGCs) to characterize the transcriptome of 1,174 cells at the single cell level. The human embryonic stem cell line BRN3B-mCherry A81-H7 was differentiated to RGCs using a guided differentiation approach. Cells were harvested at day 36 and subsequently prepared for single cell RNA sequencing. Our data indicates the presence of three distinct subpopulations of cells, with various degrees of maturity. One cluster of 288 cells upregulated genes involved in axon guidance together with semaphorin interactions, cell-extracellular matrix interactions and ECM proteoglycans, suggestive of a more mature phenotype.


BACKGROUND & SUMMARY
Since the isolation of the embryonic stem cells (ESCs) [1][2][3][4] and generation of induced pluripotent stem cells (iPSCs) 5,6 , pluripotent stem cells (PSCs) have made a tremendous contribution towards improving our understanding of mechanisms involved in development and disease. PSCs have the ability to self-renew and differentiate into all cell types of the body, thereby providing great potential for regenerative medicine and cell replacement therapies. Further, PSC-derived progeny allow investigating disease-affected cell types that are not readily accessible due to their anatomical location, such as retinal cells [7][8][9] . Utilising such disease-affected cells will also significantly improve the drug development pipeline through efficacy profiling and side effect or toxicity assessment 10 .
The development of RNA sequencing (RNA-seq) technology has allowed for the rapid quantification of individual gene transcripts. Integrating this high-throughout data with computational and statistical methods provides a toolbox to study the molecular functions of human tissues. Critically, to date, the majority of RNA-seq studies have been conducted on 'bulk' samples, consisting of millions of individual cells -the result of which is that transcript quantification represents the average signal across that cell population. Recent developments to isolate single cells, and barcode their expressed transcripts has enabled the transcriptomes of single cells to be sequenced (scRNA-seq) in a high-throughput manner. By sequencing large number of single cells from an individual 'sample' it is now possible to dissect the cellular composition of apparently homogenous tissues or cell culture [11][12][13] . scRNAseq also opens the possibility of examining rare cell populations that could not otherwise be resolved using bulk RNA-seq, and further characterising well-known cell types, for example oligodendrocytes 14 or sensory neurons 15 . Moreover, scRNA-seq may also be used for tracking cell development during differentiation, as movement between different cell types is associated with changes in gene expression. Thus, stages across a cell lineage can be distinguished by their unique transcriptional signature 16 . This technology has also been used in cell culture, in particular with PSCs, their differentiated progeny and organoids, including of the nervous system 17,18 , as a way to distinctively characterize cellular subpopulations. Results of such analyses can discern determinants of cell fates, and this information can then applied to in vitro differentiation experiments to increase efficiency of generating the tissue of interest 19 . It is estimated that there are more than 30 subtypes of RGCs in mammalian retina 20 , 21 ; however, the molecular profiling of RGCs in human disease has proven difficult.

Retinal ganglion cells (RGCs
Currently, studying optic neuropathies is hindered by the lack of non-invasive means for obtaining RGCs from living donors. This can now be circumvented by use of PSCs as a source of RGCs 8 . We recently described a protocol for the differentiation of human PSCs into functional RGCs 7 . RGCs generated through this method are functional, as exemplified by the presence of sodium and potassium currents, mature axon potentials and the expression of RGC-specific markers, including BRN3B , ISL1 and PRPH 7 . Moreover, whole transcriptome analysis through bulk RNA-seq of our PSC-derived RGCs demonstrated close resemblance to sensory neurons, and cells from the ganglion cell layer 7 . Herein we present a dataset of scRNA-seq to enable the characterization of the transcriptome of RGCs derived from human ESCs (hESCs) at a single cell level.

Ethical Approval
All experimental work performed in this study was approved by the Human Research

Cell culture and retinal differentiation
The reporter line BRN3B-mCherry A81-H7 hESC 9 was maintained on vitronectin-coated 6-well plates using StemFlex (Gibco). Culture medium was changed every second day. Cells were differentiated into RGCs as previously Medium was changed every 2-3 days. RGC differentiation was monitored by the appearance of mCherry-positive cells, reflective of BRN3B expression.
Firstly, selection of RGCs using THY1 was performed by FACS instead of the magnetic sorting we originally reported. Secondly, cells were prepared for sequencing immediately following THY1 selection and were not allowed to rest prior to being further processed. A cell pellet was resuspended in 500 μl of RGC+RI prior to sorting with a BD FACSAria III cell sorter (Becton, Dickinson). Both THY1-positive (+ve) and THY1-negative (-ve) fractions were collected in 5 mL conical tubes (BD Falcon).

Single-cell preparation
Both THY1-positive (+ve) and THY1-negative (-ve) fractions were subjected to library preparation using the Single Cell 3' Reagent Kit (10X Genomics) as per the manufacturer's instruction. This step was performed within 60 minutes of the FACS.
Briefly, cell suspension was mixed using a wide-bore tip to determine cell concentration using a Countess® Automated Cell Counter (Life Technologies). Cells were centrifuged for 5 minutes at 300g and cell pellet was resuspended in PBS with 0.04% BSA. Cell suspension was passed through a cell strainer to remove any remaining cell debris and large clumps and cell concentration was determined again.

Mapping of reads to transcripts and cells
The sequencing data was processed into transcript count tables with the Cell Ranger Single Cell Software Suite 1.3.1 by 10X Genomics (http://10xgenomics.com/). Raw base call files from the HiSeq4000 sequencer were demultiplexed with the cellranger mkfastq pipeline into library-specific FASTQ files. As the libraries were sequenced using non-standard settings, cellranger mkfastq was run with the following parameters: --use-bases-mask="Y26n*,I8n*,n*,Y98n*" --ignore-dual-index. The FASTQ files for each library were then processed independently with the cellranger count pipeline. This pipeline used STAR 22 to align cDNA reads to the Homo sapiens transcriptome (Sequence: GRCh38, Annotation: Gencode v25). Once aligned, barcodes associated with these reads -cell identifiers and Unique Molecular Identifiers (UMIs), underwent filtering and correction. Reads associated with retained barcodes were quantified and used to build a transcript count table. Resulting data for each sample were then aggregated using the cellranger aggr pipeline, which performed a between-sample normalization step and concatenated the two transcript count tables. Post-aggregation, the mapped data was processed and analyzed as described below.

Preprocessing
To preprocess the mapped data, we constructed a cell quality matrix based on the following data types: library size (total mapped reads), total number of genes detected, percent of reads mapped to mitochondrial genes, and percent of reads mapped to ribosomal genes. Cells that had any of the 4 parameter measurements higher than 3x median absolute deviation (MAD) of all cells were considered outliers and removed from subsequent analysis ( Table 1 ). In addition, we applied two thresholds to remove cells with mitochondrial reads above 20% or ribosomal reads above 50% ( Table 1 ). To exclude genes that were potentially detected from random noise, we removed genes that were detected in fewer than 1% of all cells.
The expression data was normalised on two levels to reduce possible systematic bias between samples and between cells. The first level of normalisationbetween samples, was performed prior to data aggregation using cellranger aggr's depth equalisation method 23 . This method reduces potential confounding effects caused by differences in sequencing depths between samples by subsampling mapped reads from higher-depth libraries until the number of mapped reads per library were equal. The second level of normalisation -between cells, was performed after filtering using the deconvolution approach by Lun et al . 24 . This level of normalisation reduces bias possibly caused by technical variation such as cDNA synthesis, PCR amplification efficiency and sequencing depth for each cell. The deconvolution approach was chosen as it accounts for the sparse nature of expression data by pooling expression counts from groups of cells. As the sizes of the groups were linear (40, 60, 80, 100), the group-specific normalised size factors could be deconvolved into cell-specific size factors that were then used to scale the counts of individual cells.
After normalisation, abundantly expressed ribosomal protein genes and mitochondrial genes were discarded. We have made available both the raw and normalised data on ArrayExpress under accession E-MTAB-6108 .

Identification of residual low-quality cells via clustering
We identified and removed a small group of cells with low-quality sequence data.
These cells were not detected by initial filtering; instead, they were identified via clustering and enrichment of differentially expressed genes. The transcript count table underwent dimensionality reduction using Principal Component Analysis (PCA). This procedure was applied to the top 1,500 most variable genes using the prcomp() function in R 25 . The first 20 PCs were retained and a cell-PCA eigenvector matrix was used for clustering.
We applied an unsupervised clustering method that does not take into account any predetermined parameters to objectively identify single cell subpopulations 26 .
This method is less biased compared to top-down clustering approaches, such as the k-means. Briefly, to achieve high-resolution clustering capable of detecting small subpopulations and outliers, we applied bottom-up agglomerative hierarchical clustering to construct a dendrogram tree, which contains one cell as one bottom branch (the highest resolution). We used the reduced dataset containing the top 20 PCs described above to calculate an Euclidean distance matrix between cells, and organized cells into the dendrogram using the Ward's minimum distance so that similar cells are joint into larger groups of branches. To identify subpopulations, we applied an unsupervised, objective approach to merge the branches into a high-resolution and stable clustering result. The approach divided the dendrogram tree into 40 height-windows, ranging from 0.025 (from the bottom of the tree) to 1 (from the top). By iteratively and dynamically merging cells in each of the 40 height-windows, we generated 40 independent clustering results, different in the clustering resolutions. Clustering results were then compared quantitatively using adjusted rand indices, which score pairs of cells that are the same or different between two clustering results 27 . The optimal clustering result was the most stable result across a range of consecutive tree-height values.
To characterise the identified clusters, we performed pairwise differential expression analysis by fitting a general linear model and using a negative binomial test as described in the DESeq package 28 . Network analysis was then performed on significant differentially-expressed genes using Reactome functional interaction analysis 28,29 .

Code availability
All code and usage notes available at: https://github.com/IMB-Computational-Genomics-Lab/RetinaGanglionCells . This includes: computational bioinformatic pipelines that process sequence data in BCL format through to a mapped UMI expression matrix; Scripts for quality-control, normalisation, clustering, differential expression and visualization.

Data Records
Data is available at ArrayExpress under accession number: E-MTAB-610 8 . Files consist of raw FASTQ files as well as a tab separated matrix of Transcripts Per Million for each cell passing quality control filtering.

TECHNICAL VALIDATION
The hESC reporter line BRN3B-mCherry A81-H7 was differentiated to RGCs following our established protocol 7 , changing culture medium every second day.
After 36 days, selection of RGCs using THY1 was performed by FACS. Both positive and negative THY fractions were harvested, and single cells harvested for library preparation and scRNA-Seq as outlined in Figure 1 . Processing our initial analysis identified a group of 61 cells whose expression levels indicated degradation and apoptosis ( Figure 2 ). These 61 cells were removed from the data and the expression data from the remaining 1,174 healthy cells was re-normalised and analysed.
Clustering of these 1,174 cells identified three distinct subpopulations consisting of 531, 355 and 288 cells ( Figure 3A ). We performed differential expression analysis and subsequently pathway enrichment to characterise the molecular functions of these subpopulations ( Figure 3B ).
The 531 cells from subpopulation one were upregulated for genes associated with neural cell adhesion molecule signaling for neuronal outgrowth and Hedgehog pathway, which plays various roles in patterning of the central nervous system.
Interestingly, genes implicated in collagen biosynthesis, extracellular matrix proteoglycans and axon guidance were downregulated ( Table S1 ). This pattern of gene expression suggests a progenitor or an early differentiation state. Cells from subpopulation two contained upregulated genes associated with control of the Notch protein expression, which is a crucial member of the Notch signaling pathway implicated in the neuronal function and development, and DNA repair ( Table S2 ).
Collectively, this pattern of gene expression is indicative of a more differentiated RGC phenotype than the cells in cluster one. The 288 cells identified as subpopulation three contained upregulated genes involved in axon guidance, together with semaphorin interactions, cell-extracellular matrix interactions and extracellular matrix proteoglycans. Furthermore, we observed significant downregulation of multiple genes associated with cell cycle ( Table S3 ). Taken together this indicates that this subpopulation three represents a more mature neuronal phenotype compared to cells in the other two subpopulations. Importantly, one cell within subpopulation one was found to express OPN4 a gene known to be expressed in intrinsically-photosensitive RGCs. Thus, these data indicate different levels of maturity of ESC derived RGC, with this conclusion supported by observed pathway enrichment ( Table S4-6 ).

Usage Notes
Our (c) Sequence data is processed using bioinformatic pipelines, and analysis conducted on the resulting expression matrix.