An optimised tissue disaggregation and data processing pipeline for characterising fibroblast phenotypes using single-cell RNA sequencing

Single-cell RNA sequencing (scRNA-Seq) provides a valuable platform for characterising multicellular ecosystems. Fibroblasts are a heterogeneous cell type involved in many physiological and pathological processes, but remain poorly-characterised. Analysis of fibroblasts is challenging: these cells are difficult to isolate from tissues, and are therefore commonly under-represented in scRNA-seq datasets. Here, we describe an optimised approach for fibroblast isolation from human lung tissues. We demonstrate the potential for this procedure in characterising stromal cell phenotypes using scRNA-Seq, analyse the effect of tissue disaggregation on gene expression, and optimise data processing to improve clustering quality. We also assess the impact of in vitro culture conditions on stromal cell gene expression and proliferation, showing that altering these conditions can skew phenotypes.


Single-cell RNA sequencing analysis
Single-cell RNA sequencing (scRNA-seq) was performed using a custom microfluidic platform 2 . Drop-seq experiments were performed as per Macosko et al. 3 with the following adjustments: 15 PCR cycles, 500ng from either 2000 (granulomatous inflammation), 300 (tissue disaggregated for 60 minutes) or 100 (tissue disaggregated for 15 minutes or in vitro) cells were used for Nextera XT (Illumina) library preparation. Sample libraries were loaded to a NextSeq 500 sequencer (Illumina) for paired-end read sequencing. Raw sequencing reads were aligned and collapsed onto cell barcodes (corresponding to individual mRNA capture beads) using the Drop-seq toolkit 4 . Digital mRNA transcript counting was performed using the DigitalExpression program 3 , creating a digital gene expression (DGE) matrix for downstream analysis.
The DGE was pre-processed to remove low-quality encapsulation events and apoptotic cells as follows: a novel random forest classifier 5 was used to identify low-quality events for removal from further analysis (detailed in the results section). Potentially apoptotic cells were identified as those with between the median and 2.5 MAD above the median values for percent mito, and were excluded from downstream analysis. Likely doublet encapsulation events were identified using a plot showing the total number of genes against the counts per cell: outliers for number of genes (cells with greater than 3000 genes) on this plot were excluded.
Bioinformatic analysis was performed using the Seurat package in R (version 2.3.1, Mac OS) 6 . The filtered DGE matrix (2013 cells, 31587 genes) was transformed by log((counts + 1)/10000), henceforth termed "expression level" for simplicity. Principle component analysis (PCA) dimensionality reduction was performed using the most highly variable genes (identified as outliers on a mean variability plot). Clusters were identified by the Seurat FindClusters function, which uses a shared nearest neighbour-based algorithm. Further dimensionality reduction using t-distributed Stochastic Neighbour Embedding (tSNE) was then carried out to enable visualisation of the clustering results in 2D. Analysis of clustering quality was performed in R using the silhouette analysis function included in the Cluster package (version 2.0.7-1, Mas OS) 7 .
Following identification of cell clusters, genes differentially expressed by each cluster were identified using a bimodality test (significance taken as a false discovery rate adjusted p <0.001) 8 . The differentially expressed genes were compared to gene sets from the Immunological Genome 9 and LungGENS 10 projects using the ToppFun gene set enrichment tool 11 , to assign a cell type identity to each cluster. The single-cell mRNA-seq data from this publication are available from the Gene Expression Omnibus as GSE126111.

Selecting fibroblast markers for FACS analysis
Fibroblasts are known to be heterogeneous with respect to surface marker expression: no single marker will define all fibroblasts. Platelet-derived growth factor alpha (PDGFR-) has been described as a robust marker of fibroblasts 12 , and is expressed by up to 90% of fibroblasts in solid tumours 13 . CD90 (Thy-1) is a glycosylphosphatidyl-inositol-linked cell surface glycoprotein that is expressed by the majority of normal lung fibroblasts 14 .
We used FACS to investigate the impact of varying disaggregation enzyme and incubation time on the isolation of different cell types from lung tissues. Immune, epithelial and endothelial cells were identified using the well-described markers CD45 15 , EpCAM 16 and CD31 17 , respectively. As stated previously, there is currently no single marker that will consistently identify all fibroblasts. Thus, these cells are commonly defined by their lack of expression of markers of other cell lineages. In order to compare accurately the isolation of fibroblasts from tissues, we assessed the suitability of using PDGFR-, - and CD90 as positive surface markers of fibroblasts in two primary foetal fibroblast cell lines (skin; HFFF2 and lung; IMR-90; Fig. S1). This showed CD90 to be a highly sensitive marker for IMR-90 cells (99.2% positive, 1175-fold increase in median fluorescence intensity; MFI). CD90 out-performed PDGFR-α and -β (9.2% of cells positive, with a 19-fold increase in MFI), and we therefore used CD90 positivity to identify fibroblasts from tissue samples. It is of note, however, that CD90 was expressed by only 47% of HFFF2 cells: it is not ubiquitously expressed by all cells. It is therefore likely that some fibroblasts will remain unstained using this FACS panel. Following single-cell transcriptome capture with the Drop-seq platform, a tagmented library molarity in excess of 3 pmol/l is required for subsequent sequencing 3 . Use of the described optimised disaggregation protocol generates in excess of this minimum threshold (Fig. S4), as determined using the BioAnalyzer High Sensitivity Chip (Agilent).