Single cell RNA sequencing of stem cell-derived retinal ganglion cells

We used single cell sequencing technology to characterize the transcriptomes of 1,174 human embryonic stem cell-derived retinal ganglion cells (RGCs) at the single cell level. The human embryonic stem cell line BRN3B-mCherry (A81-H7), was differentiated to RGCs using a guided differentiation approach. Cells were harvested at day 36 and prepared for single cell RNA sequencing. Our data indicates the presence of three distinct subpopulations of cells, with various degrees of maturity. One cluster of 288 cells showed increased expression of genes involved in axon guidance together with semaphorin interactions, cell-extracellular matrix interactions and ECM proteoglycans, suggestive of a more mature RGC phenotype.


Background & Summary
Since the isolation of embryonic stem cells (ESCs) [1][2][3][4] and generation of induced pluripotent stem cells (iPSCs) 5,6 , pluripotent stem cells (PSCs) have made a tremendous contribution towards improving our understanding of mechanisms involved in development and disease. PSCs have the ability to self-renew and differentiate into all cell types of the body, thereby providing great potential for regenerative medicine and cell replacement therapies. Further, PSC-derived progeny allow the investigation of diseaseaffected cell types that are not readily accessible due to their anatomical location, such as retinal cells [7][8][9] . Utilising such disease-affected cells will also significantly improve the drug development pipeline through efficacy profiling and side effect or toxicity assessment 10 .
The development of RNA sequencing (RNA-seq) technology has allowed for the rapid quantification of individual gene transcripts. Integrating this high-throughput data with computational and statistical methods provides a toolbox to study the molecular functions of human tissues. Critically, to date, the majority of RNA-seq studies have been conducted on 'bulk' samples, consisting of millions of individual cells-the result of which is that transcript quantification represents the average signal across the cell population being studied. Recent developments to isolate single cells, and barcode their expressed transcripts has enabled the transcriptomes of single cells to be sequenced (scRNA-seq) in a highthroughput manner. By sequencing large number of single cells from an individual 'sample' it is now possible to dissect the cellular composition of apparently homogenous tissues or cell cultures [11][12][13] . sc-RNAseq also opens the possibility of examining rare cell populations that could not otherwise be resolved using bulk RNA-seq, and further characterising well-known cell types, for example oligodendrocytes 14 or sensory neurons 15 . Moreover, scRNA-seq may also be used for tracking cell lineage during differentiation, as movement between different cell types is associated with changes in gene expression. Thus, stages across a cell lineage can be distinguished by their unique transcriptional signature 16 . This technology has also been used in cell culture, in particular with PSCs, their differentiated progeny and organoids, including of the nervous system 17,18 , as a way to distinctively characterize cellular subpopulations. Results of such analyses can discern determinants of cell fates, and this information can then applied to in vitro differentiation experiments to increase efficiency of generating the tissue of interest 19 .
Retinal ganglion cells (RGCs) transmit visual information from the retina to the midbrain through the optic nerve. Many diseases, such as primary open angle glaucoma, Leber hereditary optic neuropathy and autosomal dominant optic atrophy manifest by degeneration or loss of RGCs and culminate in irreversible loss of sight. It is estimated that there are more than 30 subtypes of RGCs in the mammalian retina 20 , however, the molecular profiling of RGCs in human normal and disease tissue has proven difficult. Currently, studying optic neuropathies is hindered by the lack of non-invasive means for obtaining RGCs from living donors. This can now be circumvented by use of PSCs as a source of RGCs 8 . We recently described a protocol for the differentiation of human PSCs into functional RGCs 7 . RGCs generated through this method are functional, as exemplified by the presence of sodium and potassium currents, mature axon potentials and the expression of RGC-specific markers, including BRN3B, ISL1 and PRPH 7 . Moreover, whole transcriptome analysis through bulk RNA-seq of our PSC-derived RGCs demonstrated close resemblance to sensory neurons, and cells from the ganglion cell layer 7 . Herein we present a dataset of scRNA-seq to characterize the transcriptome of RGCs derived from human ESCs (hESCs) at a single cell level.

Ethical approval
All experimental work performed in this study was approved by the Human Research Ethics committees of the University of Melbourne (0605017) with the requirements of the National Health & Medical Research Council of Australia (NHMRC) and conformed with the Declarations of Helsinki.

Fluorescence-activated cell sorting (FACS)
On day 36 of differentiation, cells were washed with phosphate-buffered saline (PBS) and incubated with Accutase (Sigma, 37°C, 5 min). Cells were then incubated in RGC differentiation medium supplemented with the ROCK inhibitor Y27632 (10 μM, Selleckchem, RGC+RI) and gently dissociated using a P1000 pipette, filtered using a 100 μm nylon strainer (BD Falcon) and centrifuged (300 g, 10 min). The cell pellet was resuspended in RGC+RI medium and incubated with THY1 antibody ( centrifuged (300 g, 3 min). Two modifications to our original protocol were performed. Firstly, selection of RGCs using THY1 was performed by FACS instead of the magnetic sorting we originally reported. Secondly, cells were prepared for sequencing immediately following THY1 selection and were not allowed to rest prior to being further processed. A cell pellet was resuspended in 500 μl of RGC+RI prior to sorting with a BD FACSAria III cell sorter (Becton, Dickinson). Both THY1-positive (+ve) and THY1negative (-ve) fractions were collected in 5 ml conical tubes (BD Falcon).

Single-cell preparation
Both THY1-positive (+ve) and THY1-negative (-ve) fractions were subjected to library preparation using the Single Cell 3′ Reagent Kit (10X Genomics) as per the manufacturer's instruction. This step was performed within 60 min of the FACS. Briefly, cell suspension was mixed using a wide-bore tip to determine cell concentration using a Countess ® Automated Cell Counter (Life Technologies). Cells were centrifuged for 5 min at 300 g and the cell pellet was resuspended in PBS with 0.04% BSA. The cell suspension was passed through a cell strainer to remove any remaining cell debris and large clumps and the cell concentration was determined again.

Generation of single cell GEMs and sequencing libraries
Single cell suspensions were loaded onto 10X Genomics Single Cell 3′ Chips along with the reverse transcription (RT) master mix as per the manufacturer's protocol for the Chromium Single Cell 3′ v2 Library (10X Genomics; PN-120233), to generate single cell gel beads in emulsion (GEMs). Sequencing libraries were generated with unique sample indices (SI) for each sample. The resulting libraries were assessed by gel electrophoresis (Agilent D1000 ScreenTape Assay) and quantified with qPCR (Illumina KAPA Library Quantification Kit). Following normalization to 2 nM, libraries were denatured and diluted to 17pM of cluster generation using the Illumina cBot (HiSeq PE Cluster Kit v4). Libraries for the two samples were multiplexed respectively, and sequenced on an Illumina HiSeq 2500 (control software v2.

Mapping of reads to transcripts and cells
The sequencing data was processed into transcript count tables with the Cell Ranger Single Cell Software Suite 1.3.1 by 10X Genomics (http://10xgenomics.com/). Raw base call files from the HiSeq2500 sequencer were demultiplexed with the cellranger mkfastq pipeline into library-specific FASTQ files. As the libraries were sequenced using non-standard settings, cellranger mkfastq was run with the following parameters: --use-bases-mask = "Y26n*,I8n*,n*,Y98n*" --ignore-dual-index. The FASTQ files for each library were then processed independently with the cellranger count pipeline. This pipeline used STAR 21 to align cDNA reads to the Homo sapiens transcriptome (Sequence: GRCh38, Annotation: Gencode v25). Once aligned, barcodes associated with these readscell identifiers and Unique Molecular Identifiers (UMIs), underwent filtering and correction. Reads associated with retained barcodes were quantified and used to build a transcript count table. Resulting data for each sample were then aggregated using the cellranger aggr pipeline, which performed a between-sample normalization step and concatenated the two transcript count tables. Post-aggregation, the mapped data was processed and analyzed as described below.

Preprocessing
To preprocess the mapped data, we constructed a cell quality matrix based on the following data types: library size (total mapped reads), total number of genes detected, percent of reads mapped to mitochondrial genes, and percent of reads mapped to ribosomal genes. Cells that had any of the 4 parameter measurements lower than 3x median absolute deviation (MAD) of all cells were considered outliers and removed from subsequent analysis ( Table 1) 22 . In addition, we applied two thresholds to remove cells with mitochondrial reads above 20% or ribosomal reads above 50% (Table 1). To exclude genes that were potentially detected from random noise, we removed genes that were detected in fewer than 1% of all cells. The expression data was normalised on two levels to reduce possible systematic bias between samples and between cells. The first level of normalisation -between samples, was performed prior to data aggregation using the cellranger aggr depth equalisation method 23 . This method reduces potential confounding effects caused by differences in sequencing depths between samples by subsampling mapped reads from higher-depth libraries until the number of mapped reads per library were equal. The second level of normalisation -between cells, was performed after filtering using the deconvolution approach by Lun et al. 24 . This level of normalisation reduces bias possibly caused by technical variation such as cDNA synthesis, PCR amplification efficiency and sequencing depth for each cell. The deconvolution approach was chosen as it accounts for the sparse nature of expression data by pooling expression counts from groups of cells. As the sizes of the groups were linear (40, 60, 80, 100), the group-specific normalised size factors could be deconvolved into cell-specific size factors that were then used to scale the counts of individual cells. After normalisation, abundantly expressed ribosomal protein genes and mitochondrial genes were discarded. We have made available both the raw and normalised data (Data Citation 1).

Identification of residual low-quality cells via clustering
We identified and removed a small group of cells with low-quality sequence data. These cells were not detected by initial filtering; instead, they were identified via clustering and enrichment of differentially expressed genes. The transcript count table underwent dimensionality reduction using Principal Component Analysis (PCA). This procedure was applied to the top 1,500 most variable genes using the prcomp() function in R 25 . The first 20 PCs were retained and a cell-PCA eigenvector matrix was used for clustering.
We applied an unsupervised clustering method that does not take into account any predetermined parameters to objectively identify single cell subpopulations 26 . This method is less biased compared to top-down clustering approaches, such as k-means. Briefly, to achieve high-resolution clustering capable of detecting small subpopulations and outliers, we applied bottom-up agglomerative hierarchical clustering to construct a dendrogram tree, where the highest resolution is one cell = one bottom branch. We used the reduced dataset containing the top 20 PCs described above to calculate an Euclidean distance matrix between cells, and organized cells into the dendrogram using the Ward's minimum distance so that similar cells are joint into larger groups of branches. To identify subpopulations, we applied an unsupervised, objective approach to merge the branches into a high-resolution and stable clustering result. The approach divided the dendrogram tree into 40 height-windows, ranging from 0.025 (from the bottom of the tree) to 1 (from the top). By iteratively and dynamically merging cells in each of the 40 height-windows, we generated 40 independent clustering results with varying resolutions. Clustering results were then compared quantitatively using adjusted Rand indices, which score pairs of cells that are the same or different between two clustering results 27 . The optimal clustering result was the most stable result across a range of consecutive tree-height values.  To characterise the identified clusters, we performed pairwise differential expression analysis by fitting a general linear model and using a negative binomial test as described in the DESeq package 28 . Network analysis was then performed on significant differentially-expressed genes using Reactome functional interaction analysis 28,29 .

Code availability
All code and usage notes are available at: https://github.com/IMB-Computational-Genomics-Lab/ RetinaGanglionCells. This includes: computational bioinformatic pipelines that process sequence data in BCL format through to a mapped UMI expression matrix; scripts for quality-control, normalisation, clustering, differential expression and visualization.

Data Records
Data is available at ArrayExpress under accession number: E-MTAB-6108. Files consist of raw FASTQ files as well as a tab separated matrix of Transcripts Per Million for each cell passing quality control filtering. BAM files can be generated by using the supplied repository to process the FASTQ files via Cell Ranger.

Technical Validation
The hESC reporter line BRN3B-mCherry A81-H7 was differentiated to RGCs following our established protocol 7 , changing culture medium every second day. After 36 days, selection of RGCs using THY1 was performed by FACS. Both positive and negative THY fractions were harvested, and single cells harvested for library preparation and scRNA-Seq as outlined in Fig. 1. Processing our initial analysis identified a group of 61 cells whose expression levels indicated degradation and apoptosis ( Fig. 2a and b). These 61 cells were removed from the data and the expression data from the remaining 1,174 healthy cells was renormalised and analysed. Clustering of these 1,174 cells identified three distinct subpopulations consisting of 531, 355 and 288 cells (Fig. 3a). We performed differential expression analysis and subsequently pathway enrichment to characterise the molecular functions of these subpopulations (Fig.  3b-g). The proportion of reads mapped to mitochondrial and ribosomal genes are displayed in Fig. 4.
The 531 cells from subpopulation one were upregulated for genes associated with neural cell adhesion molecule signalling for neuronal outgrowth and Hedgehog pathway, which plays various roles in patterning of the central nervous system. Interestingly, genes implicated in collagen biosynthesis,     extracellular matrix proteoglycans and axon guidance were downregulated (Supplementary Table S1, Data Citation 2). This pattern of gene expression suggests a progenitor or an early differentiation state. Cells from subpopulation two contained upregulated genes associated with control of the Notch protein expression, implicated in the neuronal function and development, and DNA repair (Supplementary Table  S2, Data Citation 2). Collectively, this pattern of gene expression is indicative of a more differentiated RGC phenotype than the cells in cluster one. The 288 cells identified as subpopulation three contained upregulated genes involved in axon guidance, together with semaphorin interactions, cell-extracellular matrix interactions and extracellular matrix proteoglycans. Furthermore, we observed significant downregulation of multiple genes associated with cell cycle (Supplementary Table S3, Data Citation 2). Taken together this indicates that this subpopulation three represents a more mature neuronal phenotype compared to cells in the other two subpopulations. Of note, one cell within subpopulation one was found to express OPN4 a gene known to be expressed in intrinsically-photosensitive RGCs. These data indicate different levels of maturity of ESC derived RGC, with this conclusion supported by observed pathway enrichment (Supplementary Table S4-6, Data Citation 2). We have also conducted differential expression and pathway enrichment analysis to explore expression of genes associated with different RGC subtypes. In our analysis, we identified a number of genes that were previously shown to be expressed in various RGC subpopulations (Fig. 5). These genes could mark at least 9 separate RGC subtypes that grouped into three clusters 30 . This discrepancy may be explained by the fact that the number of RGC subtypes was estimated based on their morphological features but also expression of the molecular markers, including cell surface protein THY1, transcription factors from the Brn (Pou4F) family and RNA-binding protein RBPMS 30 . In our experimental design, we included the RGC enrichment step by sorting differentiated cells for THY1 7 . Limitation of this approach is exclusion of cells that could have the RGC identity without THY1 expression; however, we chose this marker as it is the cell surface protein and thus allows the maintenance of live cells in culture prior to sequencing or further characterisation.

Usage Notes
Our experiment was designed to assess the different subpopulations of RGCs post differentiation from hESCs. hESC-derived RGCs obtained in a 36-day guided differentiation clustered into distinct subpopulations of neurons. Our initial analysis identified a group of 61 cells that showed a strong enrichment of stress and apoptosis pathways. This is possibly due to the FACS procedure itself, which can be stressful on cells. All post quality-control cells express genes relevant to RGC structure and functions. Altogether, our data provides strong support of an RGC identity of the cells in all clusters.