Paired rRNA-depleted and polyA-selected RNA sequencing data and supporting multi-omics data from human T cells

Both poly(A) enrichment and ribosomal RNA depletion are commonly used for RNA sequencing. Either has its advantages and disadvantages that may lead to biases in the downstream analyses. To better access these effects, we carried out both ribosomal RNA-depleted and poly(A)-selected RNA-seq for CD4+ T naive cells isolated from 40 healthy individuals from the Blueprint Project. For these 40 individuals, the genomic and epigenetic data were also available. This dataset offers a unique opportunity to understand how library construction influences differential gene expression, alternative splicing and molecular QTL (quantitative loci) analyses for human primary cells.


Background & Summary
RNA sequencing (RNA-seq) that utilises next-generation sequencing (NGS) is a powerful tool to understand transcriptional diversity and regulation at bulk and single-cell level. Using RNA-seq, we not only can perform differential gene expression analysis with better resolution, but also comprehensively study alternative splicing, RNA editing and allele-specific expression, and all of which can be extended to investigate molecular quantitative trait loci (QTL) when genotypes are available at a population level.
In a eukaryotic cell, 80% of the total RNAs are ribosomal RNA (rRNA) 1,2 , whereas the other 5% is polyadenylated positive (poly(A)+) mRNA 3,4 . The two most commonly used selection methods, poly(A)-selected and rRNA-depleted (ribo-minus), selectively removes a distinct set of RNAs: poly(A) negative RNAs and rRNA, respectively. The poly(A)-selected protocol enriches poly(A) + transcripts including mRNAs and many non-coding RNAs 5,6 , and also reduces the amounts of pre-mRNAs. It has become a widely used RNA-seq protocol thanks to its low-noise rate. In contrast, rRNA-depleted removes cytoplasmic and mitochondrial rRNA and thus includes poly(A) + mRNA, as well as non-coding RNAs or protein-coding mRNAs that are not polyadenylated 7 . These two library construction protocols, sequencing different fractions of the transcriptome, may lead to complex technical bias. Indeed, it is reported that different RNA sample preparations may result in significant variations in the quantification of gene expression 5 . Furthermore, the influences introduced by these two protocols on splicing quantification, and molecular QTL analysis (such as expression QTL) are largely unknown. Currently, large population genetic studies, such as GTEx 8 , lymphoblastoid cell lines (LCL) from the 1000 genome and HapMap project 9 mostly used poly(A)-selected protocol, therefore lacking the information of non-poly(A)-transcripts. In this study, we constructed both rRNA-depleted and poly(A)-selected RNA-seq libraries for CD4 +T cells from 40 donors in the Blueprint project 10 . This dataset informs understanding of how these two library construction methods affect downstream analyses, and helps develop bioinformatic tools to avoid or reduce artefacts that were introduced during experimental procedures.

Methods
Human subjects and sample collection. These methods are expanded versions of descriptions in our previously published studies 10 . As described previously, blood was obtained from donors who were members of the NIHR Cambridge BioResource (http://www.cambridgebioresource.org.uk/) with informed consent (REC 12/ EE/0040) at the NHS Blood and Transplant, Cambridge. The schematic for sample collection and processes were shown in Fig. 1. A unit of whole blood (475 mL) was collected in 3.2% Sodium Citrate. An aliquot of this sample was collected in EDTA for genomic DNA purification. A full blood count (FBC) for all donors was obtained from the EDTA blood sample, collected in parallel with the whole-blood unit, using a Sysmex Haematological analyser. The level of C-reactive protein (CRP), an inflammatory marker, was also measured in the sera of all individuals. All donors recruited in this study had FBC and CRP parameters within the normal healthy range. Blood was processed within 4 hr of collection.
CD4 + t cell enrichment. Whole blood was diluted 1:1 in a buffer of Dulbecco's Phosphate Buffered Saline (PBS, Sigma) containing 13 mM sodium citrate tribasic dehydrate (Sigma) and 0.2% human serum albumin (HSA, PAA) and separated using an isotonic Percoll gradient of 1.078 g/ml (Fisher Scientific). Peripheral blood mononuclear cells (PBMCs) were collected and washed twice with buffer, diluted to 25 million cells/mL and separated into two layers, a monocyte rich layer and a lymphocyte rich layer, using a Percoll gradient of 1.066 g/ml. Cells from each layer were washed in sampleemented PBS (13 mM sodium citrate and 0.2% HSA) and the subsets were purified using an antibody/magnetic bead strategy. CD4 + naive T cells were negatively selected using an EasySep Human Naive CD4 + T Cell Enrichment Kit (StemCell) according to the manufacturer's instructions. The purity of each cell preparation was assessed by multicolor FACS using conjugated antibodies for CD4 (RPA-T4, BD) and CD45RA (HI100, BD) for naive CD4 + T cells. The purity of the cells was provided in Table 1 www.nature.com/scientificdata www.nature.com/scientificdata/ Aligned reads distribution. In RSeQC package 11 , the geneBody_coverage.py script was used for calculating the gene body coverage of the mapped reads; the read_distribution.py script was used to calculate how mapped reads were distributed over genome feature; the tin.py script was used to evaluate RNA integrity at the transcript level.
Gene expression quantification. The gene expression and "ExonOnly" expression level which only includes exonic reads were quantified using the HTSeq (v0.11.2) 14 , respectively. The parameter of "-s reverse" was set for the strand-specific rRNA-depleted samples, whereas "-s no" for the non-strand specific poly(A)-selected samples. The raw read counts were then normalized by their library size factors and were regularized-logarithm (rlog) transformed to stabilize the variance across the samples using DESeq2 (v1.28.1) 15 . Pearson's correlation coefficients were calculated for each gene between the paired RNA-seq samples from the same individual. For unsupervised clustering analysis, we required that a gene has more than 10 reads in 20% of the samples. These quantification files were stored in public repository 16 .
Batch effect correction. The gene expression quantifications were corrected for batches using ComBat from the sva R package (v3.36.0) 17 .

Data Records
The raw fastq files and aligned BAM files for the paired RNA-seq in naive CD4 + T cells from 40 individuals were deposited at Synapse 18 . The multi-omics processed files, including the sample information, the quantifications of gene expression, isoform, exonOnly (before and after ComBat 17 ), splicing junction from both protocols, the genotype from WGS, the quantification of DNA methylation and Chip-seq of two histone markers (H3K4me1 and H3K27ac) of the corresponding individuals, were uploaded in figshare 16 . The other multi-omics files in three major human immune cell types (CD14+ monocytes, CD16+ neutrophils, and naive CD4+ T cells) from up to 197 individuals were published previously 10   www.nature.com/scientificdata www.nature.com/scientificdata/ CD4+ T cells and genotypes from WGS at European Genome-phenome Archive (EGA) under accession numbers EGAD00001002671 19 and EGAD00001002663 20 , respectively.

technical Validation
Cell purity, RNA integrity and sequencing quality. We used EasySep Human Naive CD4+ T Cell Enrichment Kit for cell enrichment and achieved an average purity of 96% as assessed by multicolor FACS (Table 1). All RNA samples used for library construction had RNA integrity (RIN) values over 8.6 (Table 1). Moreover, the RNA integrity at transcript level was further evaluated using the Transcript Integrity Number (TIN) algorithm, calculated using the tin.py script from the RSeQC package 11 . TIN calculates a score ranging from 0 to 100 for each expressed transcript, and the medTIN (median TIN score across all the transcripts) can be used to measure the RNA integrity at the sample level. All the poly(A)-selected samples have TIN scores above 77, with a mean = 78.58 and SD = 0.64 (Table 1). We constructed both ribosomal RNA depleted and poly(A)-selected RNA-seq libraries for 40 CD4 + naïve T cells ( Fig. 1 and Table 1).
The average of sequencing depth in poly(A)-selected and rRNA-depleted libraries was 38.8 M (SD = 15.93) and 60.3 M (SD = 16.7), respectively. The quality of sequencing per base was assessed using FastQC, and the Q30 is over 81.3%. There is no significant difference in the distribution of average quality score per base between the poly(A)-selected and rRNA-depleted libraries (Fig. 2a). The reads generated using both methods were distributed uniformly across the gene body (Fig. 2b), indicating the high integrity of RNA samples and no obvious 3' bias. The uniquely mapped reads account for over 92% of all reads in all samples, indicating the high quality of sequencing ( Table 1). The quality, sequencing depth and alignment statistics of all RNA samples are shown in Table 1. We further compared the gene regions that the reads of the two protocols mapped to, and found that the poly(A)-selected RNA-seq has a significantly higher level of reads aligned to coding and exonic regions, but has a much lower level in intronic regions (Fig. 2c), consistent with the observation of the previous reports 1,5 .

Gene expression quantification.
To assess the gene expression profile from poly(A)-selected and rRNA-depleted RNA-seq data, we quantified gene expression based on both STAR 12 and HISAT2 13 aligners, respectively. For these 40 paired samples, the average correlation coefficient is over 0.94, with a relatively higher correlation using STAR aligner (mean correlation coefficient increases by 0.006, Wilcoxon test, p = 0.0049). When only concerned exonic reads (ExonOnly) using STAR aligner, the correlation coefficients were further improved (Wilcoxon test, p = 0.034) (Fig. 3a), indicating the intronic reads affect the correlation between the paired samples. The correlation was also illustrated in a scatter plot showing paired poly(A)-selected and rRNA-depleted www.nature.com/scientificdata www.nature.com/scientificdata/ RNA-seq from one of 40 pairs (Fig. 3b), suggesting gene expression between two library protocols are highly correlated with some library-specific expressed genes.
Next, we compared the number and biotypes of the detected gene from both sequencing libraries after filtering the lowly expressed genes (log10 normalized expression less than 1). While on average 78.21% of genes were shared, 12.5% and 9.3% of genes were specific in poly(A)-selected and rRNA-depleted library, respectively. We found that these percentages of shared and library-protocol-specific genes were with little variances (SD ranging from 0.55% to 0.67%) in the difference of sequencing depth (Fig. 3c,d), indicating that our sequencing depths may reach saturation of quantification of the genes expressed in human CD4+ T cells in both library protocols. As expected, the poly(A)-selected protocol tends to identify more protein-coding genes, whereas the rRNA-depleted protocol, given it can identify non-poly(A) genes and has sequence strand information, identify more genes that are misc-RNA, snoRNA and antisense (Fig. 3e).
Unsupervised clustering analysis. Principal component analysis (PCA) of the data profiles in the three cell types (monocyte (n = 196), neutrophil (n = 197), CD4 + (rRNA-depleted n = 40, poly(A)-selected n = 40)) of the Blueprint project revealed that samples from the same cell type clustered closely, regardless of library construction methods (Fig. 4a). Next, we clustered the expression levels of all genes of the 40 paired CD4+ T cells using ExonOnly quantification based on STAR aligner (Fig. 4b). The samples from the two library methods were gathered together, reflecting the batch effects due to the library construction. After applying ComBat 17 , the 40 paired samples from different library construction methods were all clustered together (Fig. 4b), indicating that the improved expression quantification method and an effective batch correction can minimize the biases introduced in library construction.

Usage Notes
In general, the rRNA-depleted protocol captures different RNA species and is more efficient in quantifying linear non-poly(A) transcripts and circular RNAs. Therefore when the researchers are interested in analysing them, rRNA-depleted protocol is the one to use. However, if the protein-coding genes are the primary research targets, the poly(A)-selected protocol can identify more genes using the same sequencing depth and yield much less intronic reads. Nevertheless, our results showed that the shared genes between two protocols are highly correlated after applying ExonOnly quantification and batch correction.
Using these 40 paired poly(A)-selected and rRNA-depleted RNA-seq data from naive CD4+ T cell, the effects of library construction on the quantification of gene expression, alternative splicing and RNA editing can be assessed. One can also use this dataset to test the quantification or batch correction methods which can minimize the biases caused by library protocols. Furthermore, the 40 individuals enrolled in Blueprint Project 10 have comprehensive information including WGS, DNA methylation and Chip-seq of two histone markers (H3K4me1 and H3K27ac). This enables further investigation of the effect of library construction on multi-omics integration analysis and population genetics such as molecular QTLs of expression, splicing and RNA editing.