Introduction

Single cell RNA sequencing (scRNA-seq) has vastly improved our ability to determine gene expression and transcript isoform diversity at a genome-wide scale in different populations of cells. scRNA-seq is becoming a powerful technology for the analysis of heterogeneous immune cells subsets1,2 and studying how cell-to-cell variations affect biological processes3,4. Despite its potential, scRNA-seq data are often noisy, which are caused by a combination of experimental factors, such as the limited efficiency in RNA capture from single cells, and also by analytical factors, such as the challenges in separating true variation from technical noise5,6,7. The quality of scRNA-seq data depends on mRNA capture efficiency8, the protocol utilised to obtain libraries, as well as sequence coverage and length3,4. Bioinformatics tools for the analyses of scRNA-seq data have been rapidly evolving, whereby various algorithms have been proposed to resolve the issues related to scRNA-seq compared to classical bulk transcriptomic analysis9,10,11. However, the lack of a consensus in the data analyses further contributes to difficulties in assessing the quality of the data analysed so far.

One important consideration in designing scRNA-seq experiments is to decide on the desired sequencing depth (i.e., the expected number of reads per cell) and read length3,6. These are two important experimental parameters that can be controlled, and which need to be often predetermined before sequencing. For bulk RNA-seq data, sequencing depth and read length are known to affect the quality of the analysis12. For scRNA-seq it has been shown that half a million reads per cell are sufficient to detect most of the genes expressed, and that one million reads are sufficient to estimate the mean and variance of gene expression13. Low coverage scRNA-seq has also been utilised to show that 50,000 reads per cell are sufficient to classify a cell type in a sample of 301 cells14. Nevertheless, this may not be sufficient when more homogenous populations are involved, for example T cell subsets, such as central memory and effector memory cells. In these scenarios, deep sequencing of single cell library may be required for improving detection of genes with low expression3,6. Indeed, an important issue for scRNA-seq data is the very large number of genes with no detectable expression in a cell6. This overrepresentation of zeros in scRNA-seq datasets makes it difficult to distinguish technical dropout of transcripts from true biological variation between cells3 and statistical methods have been developed to at least control this issue (e.g.15).

Nonetheless, there has not been any systematic evaluation of the effect of sequencing depth and read length on scRNA-seq data analysis. In designing an scRNA-seq experiment it is optimal to generate data by maximising sequencing depth and utilising the longest read length. This approach would improve the quality of the reads alignment and also maximise the chance of detecting low abundant transcripts. In reality, we are often constrained by the cost of sequencing. Therefore a more practical question is to ask what is the minimum sequencing depth and read length that allows users to obtain adequate information for their desired downstream analyses.

To answer these questions, we have focussed on assessing the quality of available scRNA-seq data from T cells, which form a highly heterogeneous population of lymphocytes that play a vital role in mounting successful adaptive immune responses against intracellular pathogens and tumours16. T cells are also characterised by a highly diverse repertoire of T cell receptors (TCRs), which identify the specific recognition of the cognate antigen. TCRs are heterodimer proteins composed of two chains, α and β, and a subset of those expressing the γδ chains, which result from genetic recombination of the V(D)J genes. The diversity of TCRαβ repertoire has been associated with successful control of many pathogens17, and more recently with outcome of checkpoint inhibitor immunotherapy for patients with metastatic melanoma18. The third complementarity-determining region (CDR3) of the TCR α and β chains forms loops that engage amino acid residues of peptides in complex with MHC. Detection of the CDR3 region is a crucial step to accurately identify the clonality of a T cell repertoire, for instance responding to a viral infection. The highly polymorphic nature of the TCR genes has made their identification very difficult in bulk population sequencing datasets. In the last decade, deep sequencing approaches of bulk TCRs focussed on either α or β chains19. The advent of scRNA-seq allowed the identification of the full length TCR of both α and β chains (referred to hereafter as TCRαβ) from T cells20,21. This has now led to the capacity to simultaneously detect TCRαβ and full gene expression profiles in one experiment, thereby allowing direct study of TCR diversity and its interaction with the T cell functions reflected in gene expression profiles.

In this study we performed a comprehensive analysis of the impact of sequencing depth and read length on the detection of full length TCRαβ sequences, as well as estimation of gene expression and its effect on cell-type identification. Our study aims to fill this gap through performing a re-analysis of eight published scRNA-seq data that have a wide range of read length and sequencing depth, and analysis of simulated datasets that were subsampled from a deeply sequenced human T-cell scRNA-seq dataset. The analysis suggests important precautionary steps for researchers seeking to maximise throughput of single cell experiments without compromising the quality of the results.

Results

To assess the effects of sequencing depth and read length on accurate reconstruction of full length TCRαβ and gene expression profile from scRNA-seq data, we manually reviewed NCBI’s Gene Expression Omnibus22 and ArrayExpress23 to identify relevant T-cell scRNA-seq data published prior to April 2016. Eight datasets were identified with accessible data, collectively profiling 1,305 single cells (Table 1). The datasets were generated from mouse21,23,24,25,26 and human-derived cells20, utilising one of the available versions of the Smart-Seq protocol27, and had a wide range of sequencing depth (1.2–8.4 million paired-end (PE) reads per cell) and read length (25–215 bp) (Table 1). The mean number of expressed genes in each data set ranged between 2,354 and 6,795 (Table 1).

The effect of sequencing depth and read length on reconstruction of full-length T-cell receptors

We analysed whether sequencing depth and read length affect the detection and reconstruction of TCRαβ. Two recently developed bioinformatics methods for reconstruction of full-length TCRαβ from scRNA-seq data were used, TraCeR21 and VDJPuzzle20. The analysis performed with VDJPuzzle revealed successful TCRαβ reconstruction in 1027 cells (79%) (Table 2). This result was consistent with the results from TraCeR, with successful TCRαβ reconstruction from 953 cells (73%) (Table S1). Six of the eight datasets had a success rate >80% in detection of TCRαβ, and up to 100% for the scRNA-seq dataset with an average read length of 215 bp. The two datasets with lowest detection rate of TCRαβ had 25 and 32 bp long reads, where only 0% and 1.89% of the cells successfully generated TCRαβ sequences, respectively (Table 2 and Fig. 1A). In terms of sequencing depth, in datasets with less than 1.5 million PE reads TCRαβ were successfully detected in less than 1% of the cells, and this success increased rapidly to >80% for depths >0.25 million PE reads (Fig. 1B).

To further assess the quality of the reconstruction of TCRαβ sequence, we analysed the distribution of CDR3 amino acid sequences across both α and β chains, and the distribution of single cells carrying double α chains. The average CDR3 length of the reconstructed TCRαβ sequences with VDJPuzzle was 14 amino acids for both α and β chains (Fig. 1C and D), with similar results using TraCeR (Fig. S1). This result showed a distribution of CDR3 lengths consistent with those previously estimated with other methods, such as 5′-Race for single cell TCR analysis28.

One of the major advantages of using scRNA-seq to reconstruct TCR sequences is the possibility to detect double α chains within a single T cell. Overall, 30% (n = 395) of the cells analysed here presented more than one α but not double β. In a single study (datasets 3 and 4 in Table 1), 43% (n = 333) of the cells sequenced presented more than one α, and 44% (n = 337) had more than one β sequence detected. Notably, 29% (n = 225) of these cells had both more than two unique α and two unique β chain sequences, thus suggesting that in this study multiple cell could have been sorted in a single well. In support of this conclusion, the plot of the number of unique genes identified in those cells with two or more than unique α and two unique β chain sequences showed a significantly higher number of gene counts when compared to the remaining cells (Fig. S2). By filtering out cells with more than one α and one β, a total of 309 unique TCRαβ sequences were identified across all datasets. There was no clonotype (defined as cells bearing identical TCRαβ) overlapping between datasets.

Effect of sequencing depth and read length on the TCRαβ detection using simulated datasets

To systematically investigate the effect of sequencing depth and read length, we generated simulated datasets with different sequencing depth and read length to assess the success rate of TCRαβ reconstruction. Simulated datasets were all derived from the original datasets 1, which had deep coverage (~8.4 million PE reads per cell), and long read length (150 bp) (Table 1). The original datasets consisted of a total of 54 single cells originated from HCV specific CD8+ T cells from a single subject that previously cleared HCV. Of these cells, 18 were directly sorted from peripheral blood mononuclear cells (PBMC-derived T cells) and the remaining 36 were sorted after in vitro expansion following stimulation with cognate antigen. Of these 36, 18 were sorted after a second antigen restimulation 24 hours prior to sorting20). From each of the original single cell data (n = 54), we generated 16 randomly subsampled scRNA-seq datasets with all combinations of four different sequencing depths (0.05, 0.25, 0.625 and 1.25 million PE reads) and four different read lengths (25, 50, 100 and 150 bp) (Fig. 2A). For each of the 16 subsampled datasets, the TCRαβ sequence was reconstructed using VDJPuzzle20, and the success rate was calculated (Figs 2B and S3). Only TCRαβ sequences with a complete CDR3 recognised by the international ImMunoGeneTics information system (IMGT,29) were considered as an exact TCRαβ reconstruction.

Success rate of paired α and β was above 80% for datasets which had a minimum read length of 50 bp and a depth of at least 0.25 million reads. This rate was substantially diminished up to 0% for datasets with a number of PE reads per cell below 0.25 million PE reads (Fig. 2B). Finally, the proportion of cells with double α detected was also proportional to both read length and sequencing depth, with the highest success rate corresponding to a depth of 1.25 million PE reads and a read length above 100 bp (Fig. S4). The relationship between the success rate of TCRαβ reconstruction and both sequencing depth and read length was fitted with a sigmoidal function (Fig. S3). The success rate in TCRαβ reconstruction from the experimental datasets (the real dataset) closely followed this specific relationship (r = 0.97, p < 0.01).

The effect of read length and sequencing depth on the quantification of the gene expression profile

To assess the effect of this trend on the quantification of genes, fragments per kilo base per million (FPKM) were calculated allowing only one alignment per read, hence eliminating a potential confounding factor of multiple alignments. We found that the number of detectable expressed genes (those with FPKM >1) was positively correlated with sequencing depth (Pearson correlation = 0.89) but negatively correlated with read length (Pearson correlation = −0.93). The number of genes that were expressed in at least 10% of the cells showed a similar correlation with sequencing depth and read length (Table 3, Fig. 4A). Notably, there was a positive relationship between number of genes expressed among cells within the same dataset and read length for sequencing depth smaller than 0.625 million PE reads, while there was no variation at higher sequencing depths (Fig. 4B).

In order to quantify the reliability of the gene expression profile as a function of read length and sequencing depth, two simulated datasets with a sequencing depth of 0.05 million PE reads were generated, with read length of 25 bp and 150 bp, respectively. Two replicates for each dataset were simulated. This analysis showed a significantly higher correlation between the gene expression profiles of paired cells from the two replicates with read length 150 bp when compared to the two replicates with read length 25 bp (Fig. 4C). This result suggested that gene expression profiles from short read length dataset have higher levels of technical noise.

To further assess how the technical variation generated by shorter read length and lower sequencing depth affects the identification of the three cell sub-populations available from the experimental scRNA-seq data of HCV specific T cells20, a clustering algorithm was applied on all the simulated datasets. A newly developed bioinformatics tool CIDR30 was used to perform dimensionality reduction, Principal Coordinates Analysis (PCoA) and hierarchical clustering on the scRNA-seq gene expression profiles. When cutting the hierarchical clustering to form three clusters based on the experimentally validated cell subsets in the original data (i.e., the ground truth defined by the known cell types in the data set20), CIDR achieved the tightest clustering when the dataset has > = 100 bp long PE reads (Figs 5A,B and S5). This was evident by the higher misclassification rate calculated from the clustering analysis with shorter read length: 28% (15/54 of cells were misclassified for read length 25 and 50 bp, and 9% (5/54 for read length 100 and 150 bp (Figs 5C and S5). Sequencing depth did not affect the misclassification rate. To investigate whether the ‘tightness’ of the clustering is affected by sequencing depth and read length, the within-cluster-sum-of-squares of each cell type was computed. Consistent with the misclassification analysis, longer reads led to tighter clusters, reflected by a substantial decrease in within-class-sum-of-squares for PBMC derived Ag CD8+ T cells (Figs 5D and S5). The effect of read length was less pronounced for the other two in vitro expanded subpopulations, as these are biologically more close to each others when compared to the blood derived original population.

To analyse the effect of read length and sequencing depth on specific gene categories, the distribution of gene expression levels (in terms of log(FPKM)) was analysed for highly expressed genes (average FPKM >100), lowly expressed genes (average FPKM <100), housekeeping genes, and transcription factors in all the subsampled simulated datasets. Independent of the gene category, there was a reduction in the number of genes identified with an expression level below 100 FPKM in datasets with a low sequencing depth (<0.05 PE reads x million, Fig. S6). This effect was more evident among the transcription factors, where a combination or short read length and low depth led to a complete loss of lowly expressed genes. There was an increase in the frequency of highly abundant genes with the decrease of read length. To illustrate these trends, six individual genes were considered: three housekeeping genes (GAPDH, RPL7A, and RPL34), two genes constitutively expressed in CD8+ T cells (CD8B and TRAC), and one transcription factor (GAS5, which is associated with T cell proliferation31). The analysis showed that, contrary to the expectation, the gene expression profile of the selected housekeeping genes varied significantly for low depth and short reads (Fig. 6). Notably, the housekeeping gene RPL34 did not vary as much as GAPDH and RPL7A for low depth and short reads (Fig. 6). GAPDH and CD8B expressions were positively correlated with the read length, while a significant variability was detected for GAS5, independent of sequencing depth and read length. TRAC did not show any substantial variation.

Discussion

Here we have showed that clustering analysis with scRNA-seq data characterised by 0.25 million PE reads and length >50 bp can accurately distinguish three rare and relatively homogenous cell subsets of antigen specific T cells, thus suggesting that despite technical limitations current scRNA-seq data can successfully be applied to identify differences between rare T cell subsets, such as effector and memory subtypes during an immune response against a viral infection. On the other hand, this may not be sufficient when more homogenous populations are involved, such as central memory and effector memory T cells from the same antigen specific repertoire. Deep sequencing of single cell library may still be required to improve detection of low abundant transcripts. Indeed, an important issue for scRNA-seq data is the very large amount of genes with zero expression6. This observation results from real zero expression genes that a single cell may have at the time of RNA extraction, as well as dropout events, which are due to inefficient mRNA capture and library processing.

Full length TCRαβ can be accurately estimated and linked to the gene expression profile of the same cell. The analysis also showed multiple instances of single cells with at least two α and two β sequences detected. These findings are likely explained by the presence of multiple cells per well being sequenced, and the higher detection rate of double chains in the Th17 dataset (datasets 3 and 4) is likely due to the larger sample size compared to the other studies. The high success rate obtained with both available software programs further support the high quality of the scRNA-seq data, which significantly improve the quality of TCR reconstruction with more classical approaches such as bulk sequences and Sanger sequencing. Along with TCRαβ full-length data, the entire transcriptome can be interrogated to identify specific gene profiles associated to T cell subsets, along with the relationship with the TCRαβ clonotypes. Single cell approaches are therefore likely to increase further the accurate identification of novel markers, which could be utilised for detecting novel subpopulations of cells, for instance using flow cytometry. Another improvement is to introduce bar coding of the cell, with approaches such as MARS-seq34. These approaches however still lacks the incorporation of full-length transcriptome sequencing, hence affecting the accurate detection of full length TCRαβ. In conclusion, this study showed that future analyses should consider the quality of sequencing output to ensure reliable and accurate single cell transcriptomic profiling.

This study focussed only on T cells and scRNA-seq protocols available that allow full-length sequencing of transcripts. Barcoded methods, such as those with microfluidics based technologies and unique molecular identifiers (UMIs) are restricted to sequencing of a small fragment of the mRNA transcript, thus limiting the detection of full TCR sequences. At the time of this analysis there were no studies that performed UMI-based protocols for T cell subsets. A recent study has utilised a cell line to benchmark four UMI-based protocols (CEL-seq. 2, Drop-seq, MARS-seq, SCRB- seq) against Smart-seq. 1, and Smart-seq. 235. The analysis tested the sensitivity of UMI and Smart-seq methods on the same samples by assessing the number of genes detected as a function of sequencing depth. This analysis showed that Smart-seq. 2 had the highest sensitivity of detection, and 0.625 Million PE reads were sufficient to obtain an optimum number of genes, which is in line with our simulations. UMI-methods however quantified mRNA levels with less amplification noise due to the use of UMI. A comprehensive analysis of 15 experimental protocols utilizing ERCC spike-ins on the effect of sequencing depth and on the limit of detection (lower molecular-detection limit, for a given sequencing depth), confirmed high sensitivity of Smart-seq protocols and comparable but variable accuracy when compared to UMI-based methods36. Consistent with our findings this analysis also showed that sensitivity is highly dependent on sequencing depth, experimentally confirming our value for the optimal sequencing depth (>0.625 million PE reads, i.e. 1.25 million reads in total) whereby the increase in read depth from 1 million reads to 4.5 million reads per sample results in marginally increased sensitivity. Finally, UMI-based methods can be modified to obtained VDJ region of the TCR at the single cell level. For instance 10x Genomics has recently released a new protocol to sequence VDJ sequences from bar coded T cells (https://www.10xgenomics.com/vdj/). This approach however does not provide simultaneous analysis of gene expression and TCR repertoire from the same cell.

In conclusion, our results based on T cell data are consistent with benchmark studies on other cells, showing that the full-length scRNA-seq methods provide good accuracy, high sensitivity and larger potential for applications in T cell biology. Future development of barcoded technologies that allow full-length transcriptomic sequencing will allow the application of these technologies to a larger number of cells, enabling comprehensive study of the role of T cells in experimental models and human disease states.

Material and Methods

ScRNA-seq data

Raw data were downloaded from NCBI’s Gene Expression Omnibus and ArrayExpress (Table 1).

Generation of simulated datasets

Simulated data were obtained by generating subset of reads from Dataset 1 in Table 1 by randomly reducing read length and sequencing depth using an in-house python scripts. Original dataset consisted of 54 scRNA-seq data all with read length of 145 bp and sequencing depth of about 8.4 million PE reads. Sixteen combinations of read length and sequencing depth were considered: read length of 25, 50, 100 and 150 bp; and sequencing depth of 0.1, 0.5, 1.25 and 2.5 million. A new set of paired fastq file for each combination was then generated. For each dataset, reduced depth was obtained by randomly subsampling the original set of PE reads while the shorter read length was obtained by randomly cropping original PE reads.

Gene expression quantification

PE reads were analysed for quality control using FastQC, and reads were trimmed using Trimmomatic37. For Trimmomatic we used the following parameters: Nextera adapters, leading = 3, trailing = 3, window length = 4, window quality = 15, average quality = 20, minimum length was chosen according to the read length of the dataset. Alignment of PE reads was performed with TopHat2. For the alignment, the default option was used, (https://ccb.jhu.edu/software/tophat/manual.shtml).

Gene expression was estimated with the pipeline available in Cufflinks 2.2.1, utilising CuffQuant with parameter–max-frag-multihits equal to 1, which allows maximum one alignment per fragment. Gene expression quantification (in FPKM) was normalised with CuffNorm using default parameters. Resulting FPKM values were manipulated in R using the package Monocle (version 2)38, using the detectGenes function to count and filter genes by FPKM value. Downstream analysis, which included Pearson’s correlation analysis, number of genes expressed, and gene expression analysis by gene categories, was performed with an in-house R script. Transcription factors and housekeeping genes have been selected from available list in the literature39,40.

Dropout rate and clustering analysis

Dropout analysis, principal coordinates analysis and clustering were performed using CIDR30, which requires raw read counts as input data. The tool featureCounts was used to obtain the read counts, with the–primary option to allow only primary alignments. The dropoutCandidates Boolean matrix output by the determinDropoutoutCandidates method of CIDR is used to calculate the figures in Table 3 and Fig. 5 – a gene is considered ‘expressed’ in a sample if the corresponding entry in the dropoutCandidates matrix has a value of FALSE.

For clustering, the CIDR parameters nCluster and wThreshold were set to be 3 and 6 respectively, while the other CIDR parameters were left as defaults. Within each cluster, the first two CIDR principal coordinates were used to calculate Euclidean distances between all pairs of samples, the squares of which sum to the within-class sum of squares.

Misclassification rate was used to evaluate the accuracy of clustering, which is defined as the number of misclassified cells divided by the total number of cells. To define misclassified cells, each CIDR cluster is associated with the ground truth cluster, which gives the biggest intersection, and those cells that are not in the intersection are counted as misclassified cells.

Reconstruction of TCRαβ

TCRαβ of the downloaded dataset were reconstructed using VDJPuzzle41. A second method was used to validate the VDJPuzzle result using the program TraCeR21 (see Supplementary Text). VDJPuzzle algorithm is briefly outlined in the following steps (see41 for more details): for each single cell and for each chain: i) it aligns the reads to the full reference genome; ii) it extracts the reads that align to one known VDJ gene or constant region and assemble these reads using a de novo assembly algorithm (Trinity42); iii) all reconstructed sequences with a match to the IMGT database are collected in a preliminary TCR sequence identification; iv) the original reads are re-aligned (using bowtie 243) against the putative TCR sequences, to include additional reads that could have been lost in the first alignment;.v) resulting aligned reads aligning to the preliminary repertoire are assembled with Trinity and interrogated to IMGT with MigMap, a smart wrapper for IgBlast (https://github.com/mikessh/migmap).

Successful reconstruction of a TCR from a single cell was defined as at least one complete and in-frame TCR sequence (α, β, or both) identified in the IMGT database. The exact procedure was performed as previously reported20.

The fit of the proportion of cells with successful TCRαβ reconstruction as a function of read length and sequencing depth was performed using a two-dimensional sigmoidal function implemented in the scipy package in python (the “curve_fit”)

$$S{R}_{\alpha ,\beta }=\,\frac{q1}{1+{e}^{-a(x-{x}_{0})}}\,\times \,\frac{q2}{1+{e}^{-b(y-{y}_{0})}}$$
(1)

Where x represents the read length and y represents the sequencing depth. The obtained fitting values are $$q1=0.94,\,q2=0.95,\,a=0.52,\,b=98.67,\,{x}_{0}=24.2,\,{y}_{0}=8.85$$.

Availability of data and materials

The scRNA-seq data analysed are freely available and accession numbers are provided in Table 1.