Impact of sequencing depth and read length on single cell RNA sequencing data: lessons from T cells

Single cell RNA sequencing (scRNA-seq) has shown great potential in measuring the gene expression profiles of heterogeneous cell populations. In immunology, scRNA-seq allowed the characterisation of transcript sequence diversity of functionally relevant sub-populations of T cells, and notably the identification of the full length T cell receptor (TCRαβ), which defines the specificity against cognate antigens. Several factors, such as RNA library capture, cell quality, and sequencing output have been suggested to affect the quality of scRNA-seq data, but these factors have not been systematically examined. We studied the effect of read length and sequencing depth on the quality of gene expression profiles, cell type identification, and TCRαβ reconstruction, utilising 1,305 publically available scRNA-seq datasets, and simulation-based analyses. Gene expression was characterised by an increased number of unique genes identified with short read lengths (<50 bp), but these featured higher technical variability compared to profiles from longer reads. TCRαβ were detected in 1,027 cells (79%), with a success rate between 81% and 100% for datasets with at least 250,000 (PE) reads of length >50 bp. Sufficient read length and sequencing depth can control technical noise to enable accurate identification of TCRαβ and gene expression profiles from scRNA-seq data of T cells.


INTRODUCTION
Single cell RNA sequencing (scRNA-seq) has vastly improved our ability to determine gene expression and transcript isoform diversity at a genome-wide scale in different populations of cells. scRNA-seq is becoming a powerful technology for the analysis of heterogeneous immune cells subsets 1,2 and studying how cell-to-cell variations affect biological processes 3,4 . Despite its potential, scRNA-seq data are often noisy, which are caused by a combination of experimental factors, such as the limited efficiency in RNA capture from single cells, and also by analytical factors, such as the challenges in separating true variation from technical noise 5-7 . The quality of scRNA-seq data depends on mRNA capture efficiency 8 , the protocol utilised to obtain libraries, as well as sequence coverage and length 3,4 . Bioinformatics tools for the analyses of scRNA-seq data have been rapidly evolving, whereby various algorithms have been proposed to resolve the issues related to scRNA-seq compared to classical bulk transcriptomic analysis 9,10 . However, the lack of a consensus in the data analyses further contributes to difficulties in assessing the quality of the data analysed so far.
One important consideration in designing scRNA-seq experiments is to decide on the desired sequencing depth (i.e., the expected number of reads per cell) and read length 3,6 . These are two important experimental parameters that can be controlled, and which need to be often predetermined before sequencing. For bulk RNA-seq data, sequencing depth and read length are known to affect the quality of the analysis 11 . For scRNA-seq it has been shown that half a million reads per cell are sufficient to detect most of the genes expressed, and that one million reads are sufficient to estimate the mean and variance of gene expression 6 . Low coverage scRNA-seq has also been utilised to show that 50,000 reads per cell are sufficient to classify a cell type in a sample of 301 cells 12 . Nevertheless, this may not be sufficient when more homogenous populations are involved, for example T cell subsets, such as central memory and effector memory cells. In these scenarios, deep sequencing of single cell library may be required for improving detection of genes with low expression 3,6 . Indeed, an important issue for scRNAseq data is the very large number of genes with no detectable expression in a cell 6 . This overrepresentation of zeros in scRNA-seq datasets makes it difficult to distinguish technical dropout of transcripts from true biological variation between cells 3 .
Nonetheless, there has not been any systematic evaluation of the effect of sequencing depth and read length on scRNA-seq data analysis. In designing a scRNA-seq experiment it is optimal to generate data by maximising sequencing depth and utilising the longest read length. This approach would improve the quality of the reads alignment and also maximise the chance of detecting low abundant transcripts. In reality, we are often constrained by the cost of sequencing.
Therefore a more practical question is to ask what is the minimum sequencing depth and read length that allows users to obtain adequate information for their desired downstream analyses.
To answer these questions, we have focussed on assessing the quality of available scRNA-seq data from T cells, which form a highly heterogeneous population of lymphocytes that play a vital role in mounting successful adaptive immune responses against intracellular pathogens and tumours 13 . T cells are also characterised by a highly diverse repertoire of TCRs, which identify the specific recognition of the cognate antigen. TCRs are heterodimer proteins composed of two chains, α and β , and a subset of those expressing the γ δ chains, which result from genetic recombination of the V(D)J genes. The diversity of TCRαβ repertoire has been associated with successful control of many pathogens 14 , and more recently with outcome of checkpoint inhibitor immunotherapy for patients with metastatic melanoma 15 . The highly polymorphic nature of the TCR genes has made their identification very difficult in bulk population sequencing datasets. In the last decade, deep sequencing approaches of bulk TCRs focusing on either α or β chains 16 .
The advent of scRNA-seq allowed the identification of the full length TCR of both α and β chains (referred to hereafter as TCRαβ) from T cells 17,18 . This has now led to the capacity to simultaneously detect TCRαβ and full gene expression profiles in one experiment, thereby allowing direct study of TCR diversity and its interaction with the T cell functions reflected in gene expression profiles.
In this study we performed a comprehensive analysis of the impact of sequencing depth and read length on the detection of full length TCRαβ sequences, as well as estimation of gene expression and its effect on cell-type identification. Our study aims to fill this gap through performing a reanalysis of eight published scRNA-seq data that have a wide range of read length and sequencing depth, and analysis of simulated datasets that were subsampled from a deeply sequenced human T-cell scRNA-seq dataset. The analysis suggests important precautionary steps for researchers seeking to maximise throughput of single cell experiments without compromising the quality of the results.

RESULTS
To assess the effects of sequencing depth and read length on accurate reconstruction of full  Fig. 1A). In terms of sequencing depth, datasets with less than 0.25 million PE reads resulted in detection of TCRαβ in less than 1% of the cells, and this increased rapidly to >80% for depths >0.25 million PE reads (Fig. 1B).
To further assess the quality of the reconstruction of TCRαβ sequence, we analysed the distribution of CDR3 amino acid sequences across both α and β chains, and the distribution of cells carrying double α chains. The average CDR3 length of the reconstructed TCRαβ sequences with VDJPuzzle was 14 amino acids for both α and β chains ( Fig. 1C and D), with similar results using TraCeR (Fig. S1). This result shows a distribution of CDR3 lengths consistent with those previously estimated with other methods. In addition, we assessed the distribution of single cell carrying double α chains. One of the major advantages of using scRNA-seq to reconstruct TCR sequences is the possibility to detect double α chains within a single T cell. Overall, 30% (n=395) of the cells analysed here presented more than one α but not double β . In a single study (datasets 3 and 4 in Table 1), 43% (n=333) of the cells sequenced presented more than one α , and 44% (n=337) had more than one β sequence detected. Notably, 29% (n=225) of these cells had both more than two unique α and two unique β chain sequences, thus suggesting that in this study multiple cell could have been sorted in a single well. By filtering out cells with more than one α and one β , a total of 309 unique TCRαβ sequences were identified across all datasets.
There was no clonotype (defined as cells bearing identical TCRαβ) overlapping between datasets.

Effect of sequencing depth and read length on the TCRαβ detection using simulated datasets
To systematically investigate the effect of sequencing depth and read length, we generated simulated datasets with different sequencing depth and read length to assess the success rate of TCRαβ reconstruction. Simulated datasets were all derived from the original datasets 1 and 2, which had the deepest coverage (~8.4 million PE reads per cell) and longest read length ( Table   1). The original dataset consisted of a total of 54 single cells originated from HCV specific CD8+ T cells from a single subject that previously cleared HCV. Of these cells, 18 were directly sorted from peripheral blood mononuclear cells (PBMC-derived T cells) and the remaining 36 were sorted after in vitro expansion following stimulation with cognate antigen. Of these 36, 18 were sorted after a second antigen restimulation 24 hours prior to sorting 17 ). From each cell, we Only TCRαβ sequences with a complete CDR3 recognised by the international ImMunoGeneTics information system (IMGT, 25 ) were considered as an exact TCRαβ reconstruction.
Success rate of paired α and β was above 80% for datasets which had a minimum read length of 50 bp and a depth of at least 0.25 million reads. This rate was substantially diminished up to 0% for datasets with a number of PE reads per cell below 0.25 million PE reads ( Fig 2B). Finally, the proportion of cells with double α detected was also proportional to both read length and sequencing depth, with the highest success rate corresponding to a depth of 0.25 million PE reads and a read length above 100 bp (Fig. S3). The relationship between the success rate of TCRαβ reconstruction and both sequencing depth and read length was fitted with a sigmoidal function (Fig. S2). The success rate in TCRαβ reconstruction from the experimental datasets (the real dataset) closely followed this specific relationship (r=0.97, p <0.01).

The effect of read length and sequencing depth on the quantification of the gene expression profile
Next, we used the 16 subsampled scRNA-seq datasets to investigate the effect of sequencing depth and read length on read alignment and gene expression quantification. Surprisingly, we observed a slight increase in the total number of aligned PE reads in datasets with shorter read length, especially when the read length was below 100 bp (Fig. 3). This higher level of total read alignment at short read length can be attributed to an increased proportion of reads with multiple alignments, and more discordant alignment of PE reads (Fig. 3). Notably, this relationship with read length was also observed for the proportion of concordant pairs aligned, but with a lower proportion for reads of 25 bp long compared to 50 bp. We did not observe any effect of the sequencing depth on read alignment.
To assess the effect of this trend on the quantification of genes, fragments per kilo base per million (FPKM) were calculated allowing only one alignment per read, hence eliminating a potential confounding factor of multiple alignments. . We found that the number of detectable expressed genes (those with FPKM>1) was positively correlated with sequencing depth (Pearson correlation = 0.89) but negatively correlated with read length (Pearson correlation = -0.93). The number of genes that were expressed in at least 10% of the cells showed a similar correlation with sequencing depth and read length (Table 3, Fig. 4A). Notably, there was a positive relationship between number of genes expressed among cells within the same dataset and read length for sequencing depth smaller than 0.625 million PE reads, while there was no variation at higher sequencing depths (Fig. 4B).
In order to quantify the reliability of the gene expression profile as a function of read length and sequencing depth, two simulated datasets with a sequencing depth of 0.05 million PE reads were generated, with read length of 25 bp and 150 bp, respectively. Two replicates for each dataset were simulated. This analysis showed a significantly higher correlation between the gene expression profiles of paired cells from the two replicates with read length 150 bp when compared to the two replicates with read length 25 bp (Fig. 4C). This result suggested that gene expression profiles from short read length dataset have higher levels of technical noise.
To further assess how the technical variation generated by shorter read length and lower sequencing depth affects the identification of the three cell sub-populations available from the experimental scRNA-seq data of HCV specific T cells 17 , a clustering algorithm was applied on all the simulated datasets. A newly developed bioinformatics tool CIDR 26 was used to perform dimensionality reduction, Principal Coordinates Analysis (PCoA) and clustering on the scRNAseq gene expression profiles. When forced to identify three clusters (since there are three celltypes in the dataset), CIDR achieved the best clustering when the dataset has >= 100 bp long ( Fig. 5A, B, Fig. S4). A higher misclassification rate with shorter read length was observed: 28% for read length 25 and 50 bp, and 9% for read length 100 and 150 bp (Fig 5C, Fig. S4).
1 1 of the clustering is affected by sequencing depth and read length, the within-cluster-sum-ofsquares of each cell type was computed. Consistent with the misclassification analysis, longer reads led to tighter clusters, reflected by a substantial decrease in within-class-sum-of-squares for PBMC derived Ag CD8+ T cells (Fig 5D, Fig. S4). The effect of read length was less pronounced for the other two in vitro expanded subpopulations, as these are biologically more close to each others when compared to the blood derived original population.
To analyse the effect of read length and sequencing depth on specific gene categories, the distribution of gene expression levels (in terms of log(FPKM)) was analysed for highly expressed genes (average FPKM > 100), lowly expressed genes (average FPKM < 100), housekeeping genes, and transcription factors in all the subsampled simulated datasets.
Independent of the gene category, there was a reduction in the number of genes identified with an expression level below 100 FPKM in datasets with a low sequencing depth (< 0.05 PE reads x million, Fig. S5). This effect was more evident among the transcription factors, where a combination or short read length and low depth led to a complete loss of lowly expressed genes.
There was an increase in the frequency of highly abundant genes with the decrease of read length. To illustrate these trends, six individual genes were considered: two housekeeping genes (GAPDH, RPL7A, and RPL34), two genes constitutively expressed in CD8+ T cells (CD8B and TRAC), and one transcription factor (GAS5, which is associated with T cell proliferation 27 ). The analysis showed that, contrary to the expectation, the gene expression profile of the selected housekeeping genes varied significantly for low depth and short reads (Fig. 6). GAPDH and CD8B expressions were positively correlated with the read length, while a significant variability was detected for GAS5, independent of sequencing depth and read length. TRAC did not show any substantial variation.

DISCUSSION
This study aimed to explore how sequencing depth and read length of scRNA-seq dataset affect various downstream analyses, such as transcript reconstruction, gene expression estimation and cell-type identification. The overall messages of this study can be summarised with two major

ScRNA-seq data
Raw data were downloaded from NCBI's Gene Expression Omnibus and ArrayExpress (Table   1).

Generation of simulated datasets
Simulated data were obtained by generating subset of reads from Dataset 1 in Table 1

Gene expression quantification
PE reads were analysed for quality control using FastQC, and reads were trimmed using Trimmomatic 29 .
Alignment of PE reads was performed with TopHat2. For the alignment, the default option was used, (https://ccb.jhu.edu/software/tophat/manual.shtml) which corresponded to allow for only one alignment per read. In case of multiple alignments of the same read, the primary alignment was considered.
Gene expression was estimated with the pipeline available in Cufflinks 2. For clustering, the CIDR parameters nCluster and wThreshold were set to be 3 and 6 respectively, while the other CIDR parameters were left as defaults. Within each cluster, the first two CIDR principal coordinates were used to calculate the distances between all pairs of samples, the squares of which sum to the within-class sum of squares.
Misclassification rate was used to evaluate the accuracy of clustering, which is defined as the number of misclassified cells divided by the total number of cells. To define misclassified cells, each CIDR cluster is associated with the ground truth cluster, which gives the biggest intersection, and those cells that are not in the intersection are counted as misclassified cells.

Reconstruction of TCRαβ
TCRαβ of the downloaded dataset were reconstructed using VDJPuzzle 17 and TraCeR 18 . The exact procedure was followed as previously reported 17,18 .
The fit of the proportion of cells with successfull TCRαβ reconstruction as a function of read length and sequencing depth was performed using a two-dimensional sigmoid function implemented in the scipy package in python (the "curve_fit")           (Table 1).