Short-read and long-read RNA sequencing of mouse hematopoietic stem cells at bulk and single-cell levels

Hematopoietic stem cells (HSCs) lie at the top of the differentiation hierarchy. Although HSC and their immediate downstream, multipotent progenitors (MPP) have full multilineage differentiation capacity, only long-term (LT-) HSC has the capacity of long-term self-renewal. The heterogeneity within the HSC population is gradually acknowledged with the development of single-cell RNA sequencing and lineage tracing technologies. Transcriptional and post-transcriptional regulations play important roles in controlling the differentiation and self-renewal capacity within HSC population. Here we report a dataset comprising short- and long-read RNA sequencing for mouse long- and short-term HSC and MPP at bulk and single-cell levels. We demonstrate that integrating short- and long-read sequencing can facilitate the identification and quantification of known and unannotated isoforms. Thus, this dataset provides a groundwork for comprehensive and comparative studies on transcriptional diversity and heterogeneity within different HSC cell types. Measurement(s) short term hematopoietic stem cell • long term hematopoietic stem cell • multipotent stem cell Technology Type(s) RNA sequencing Factor Type(s) read length Sample Characteristic - Organism Mus musculus Measurement(s) short term hematopoietic stem cell • long term hematopoietic stem cell • multipotent stem cell Technology Type(s) RNA sequencing Factor Type(s) read length Sample Characteristic - Organism Mus musculus Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.16759069


Background & Summary
Hematopoiesis starts with a population of self-renewing hematopoietic stem cells (HSCs), which produce a series of increasingly more lineage-committed progenitor cells, ultimately giving rise to all types of mature blood cells. In the conventional model, long-term (LT) HCSs differentiates into short-term (ST) HSCs and subsequently multipotent progenitors (MPPs). Although all three populations have full multilineage differentiation capacity, they gradually lose self-renewal ability. Heterogeneity is presented in both HSC and MPP populations with distinct lineage bias 1 .
Both transcriptional and post-transcriptional regulations are pivotal in balancing the constitutive and low-level HSC turnover and downstream differentiation and hematopoietic reconstitution. In multicellular organisms, alternative splicing is a key post-transcriptional regulation mechanism that expands transcript diversity. Accumulating studies have revealed that alternative splicing patterns are essential during hematopoiesis. For instance, alternative splicing events specific in blood progenitors 2 or megakaryocyte and erythrocyte lineage commitment 3 were identified. Alternative splicing patterns in key hematopoietic regulators, such as HMGA2 was found to impact HSC molecular identity 4 . Moreover, aberrant AS is a hallmark of various types of cancer 5 , including leukemia 6 .
RNA sequencing (RNA-seq) that utilises short-read next-generation sequencing (NGS) or long-read sequencing (such as PacBio and Oxford Nanopore Technologies sequencing) is a powerful tool in understanding transcriptional diversity and regulation during various biological processes, including hematopoiesis 2 . While NGS is more reliable in expression quantification, the short reads can only provide limited information in AS events across the splicing joints. In contrast, the long-read sequencing methods offer a unique opportunity to examine alternative splicing isoforms with unprecedented accuracy in providing full-length information.
To purify LT/ST-HSC and MPP, cells were sorted on a BD FACSAria SORP according to the following markers: For single-cell RNA-sequencing (scRNA-seq), cells were individually sorted into 8-Strip PCR tubes containing the lysis buffer. For bulk RNA-seq, 100 cells (P100) were sorted into one PCR tube as a biological replicates. The batch information for cell sorting was included in Online-only Tables 1, 2.
Library construction and sequencing. The cDNA libraries were generated following the Smart-seq2 protocol 7 and the Illumina Nextera XT DNA preparation kit was used. The batch information of cDNA generation and sequencing was included in Online-only Tables 1-3.
Short-read sequencing. cDNA was then sheared randomly by Bioruptor Pico sonication device (Diagenode) for Illumina library preparation protocol including DNA fragmentation, end repairing, 3′ ends A-tailing, adapter ligation, PCR amplification and library validation. After library preparation, PerkinElmer LabChip ® GX Touch and Step OnePlus ™ Real-Time PCR System were introduced for library quality inspection. Qualified single-cell or P100 libraries were then loaded on Illumina Hiseq X Ten platform for PE150 sequencing. The average sequencing depth was 23.3 million reads (SD = 3.41) for P100 and 4.2 million reads (SD = 1.1) for single cells. Sequencing library construction and sequencing were done in Annoroad Gene Technology (Beijing, China).
PacBio sequencing. The full-length cDNA libraries were generated as described above, 5 µg of full-length cDNA were used for size selection using the BluePippin ™ Size Selection System (Sage Science, Beverly, MA, USA). SMRTbell library was constructed using 1 μg size-selected (above 4 kb) cDNA with the Pacific Biosciences SMRTbell template prep kit. The binding of SMRTbell templates to polymerases was conducted using the Sequel II Binding Kit, and then primer annealing was performed. Sequencing was carried out on the Pacific Bioscience (PacBio) Sequel II platform. Sequencing library construction and sequencing were done in Annoroad Gene Technology (Beijing, China).
Oxford Nanopore Technologies cDNA sequencing. Similarly, the full-length cDNA was generated using the Smart-seq2 protocol as described above. About 250 ng cDNA were subjected to end-repair and dA-tailing ligation using NEBNext FFPE DNA Repair Mix (NEB, Cat No. M6630L) and NEBNext Ultra II End Repair/ dA-Tailing Module (NEB, Cat No. E7645). PromethION library preparation was performed according to the manufacturer's instructions (Oxford Nanopore Technologies, SQK-LSK109 Kit, EXP-NBD104 and EXP-NBD114). Sequencing was done on the Promethon p48 platform. Sequencing library construction and sequencing were done in Biomarker Technologies Co., Ltd (Beijing, China).
Sequencing data analysis. All programs used default parameters unless stated otherwise in the corresponding section.
Single-cell short-read sequencing. Single-cell analysis was carried out using the Seurat R package (v3.2) 8 . We filtered cells using the following criteria: (a) feature counts below 500 or above 8000, (b) UMI counts below 70,000 or over 5,000,000, (c) identified as doublets using DoubletFinder (v2.0.3) 9 with the parameter of "PCs = 1:6, pN = 0.25, pK = 0.09, nExp = 24". A total of 109 LT-HSCs, 98 ST-HSCs and 107 MPPs were kept for further analyses. The quality statistics were provided in Online-only Table 1. We used principal component analysis (PCA) with variable genes as input and identified the top 6 significant PCs that were used as input for tSNE (t-distributed stochastic neighbor embedding). Cell markers from previously published studies 10,11 were used to verify the cell types. Differentially expressed genes in the cell types were determined using the FindAllMarkers function in Seurat R package (v3.2) 8 .
Gene body coverage, distribution of aligned reads over genome feature and RNA integrity at transcript-level were calculated using RSeQC 12 . The RNA integrity at the transcript-level was evaluated using the Transcript Integrity Number (TIN) algorithm 14 . These quality statistics were provided in Online-only Table 2. Gene expression was quantified using function summarizeOverlaps from GenomicAligments (v1.24.0) 15 with parameters of www.nature.com/scientificdata www.nature.com/scientificdata/ "mode = "Union", singleEnd = FALSE, ignore.strand = TRUE, fragments = TRUE". The read counts were then normalized by function rlog with the parameter of "blind = FALSE" in DESeq2 (v1.28.1) 16 . Principal component analysis (PCA) and sample distances were calculated using FactoMineR (v2.4) 17  www.nature.com/scientificdata www.nature.com/scientificdata/ respectively. The batch effect (provided in Online-only Table 2) was corrected using ComBat from the sva package (v3.36.0) 18 .
Bulk long-read sequencing. The alignment was performed using Minimap2 (v2.17-r974-dirty) 19 with the parameter of "-ax splice". The number of reads, read length and quality were analyzed using NanoComp software (v1.33.1) 20 with the parameter of "-raw-store-tsv_stats" and visualized using R package ggplot2 (v3.3.2) 21 after filtering the reads with length below 200 or over 150,000 bp. These quality statistics were provided in Online-only Table 3.
Stringtie2 (v2.1.7) was used to assemble and quantify both short and long reads with or without a reference guiding. The parameter of the assembly process was "-p 20 -e -G -A", and "-L" for long reads. The read counts of genes were extracted with a python script (prepDE.py3) with "-l 600" and normalized by function rlog with the parameter of "blind = FALSE" in DESeq2 (v1.28.1) 16 . The batch effects of gene expression from bulk Illumina, PacBio and Oxford Nanopore Technologies sequencing were corrected using ComBat from the sva package (v3.36.0) 18 . Principal component analysis (PCA) was calculated using plotPCA function in DESeq2 (v1.28.1) 16 . Correlation was calculated by function cor from R package stats (v4.0.2) with the parameter of "method = pearson". GffCompare (v0.12.1) 22 was used to compare and evaluate the accuracy of Stingtie2 23 transcript assembly. SUPPA2 (v2.3) 24 was employed to classify the AS events with the parameter of "-f ioe -e SE SS MX RI FL". Seven AS types were identified, including skipping exon (SE), retained intron (RI), alternative 5′ splice site (A5), alternative 3′ splice site (A3), mutually exclusive exons (MX), alternative first exon (AF), in which alternative first-exon use results in mRNA isoforms with distinct 5′ UTRs, and alternative last exon (AL), in which alternative use of multiple polyadenylation sites results in distinct terminal exons.

Data Records
The raw data were deposited at NCBI BioProject under accession number PRJNA706066 28  www.nature.com/scientificdata www.nature.com/scientificdata/

technical Validation
Quality control for short-read illumina sequencing data. Read quality was assessed using FastQC, the number of total and clean reads, percentage of unique mapped reads, the mapped read length, Q30 and TIN (Transcript Integrity Number) statistics were provided in Online-only Tables 1-2. TIN indicates the RNA integrity at transcript-level 14 . The bulk RNA-seq samples generally have higher TIN scores than the single-cell samples. All single-cell samples had median TIN scores ranging from 24.8 to 51.8, with an average value of 40.7. Similar ranges of TIN were observed for all cell types (Online-only Table 1). Whereas the bulk samples had median TIN scores ranging from 49.4 to 66.1, with an average value of 56.9 (Online-only Table 2).
There was no significant difference in the distribution of average quality score per base among the samples of different cell types for both single-cell (Fig. 2a) and bulk (Fig. 3a) samples, and the reads were distributed approximately uniformly across the gene body for both datasets (Figs. 2b, 3b), indicating the high integrity of the RNA. We further checked the gene regions where the reads were mapped to and found that all samples have significantly more reads mapped to the exonic regions, while less were mapped to the intronic regions ( Fig. 2c and Fig. 3c), in line with the observation of the previous reports 30 .
For single-cell sequencing data, we also examined the proportion of reads mapped to mitochondrial and ribosomal genes (Fig. 2d). The median percentage of mitochondrial genes per cell was 0.29 and the median www.nature.com/scientificdata www.nature.com/scientificdata/ percentage of ribosomal genes per cell was 3.04. MPP had the highest average number of detected genes (Fig. 2e), significantly higher than LT-HSC (p < 0.001, T-test), while the number of UMI counts per cell were comparable amongst three cell types (Fig. 2f). UMAP graph also showed that ST-HSC lied in between LT-HSC and MPP (Fig. 2g). The differentially expressed genes were analyzed between the cell types. There were 62, 63 and 266 differentially expressed genes (FDR < 0.05, log2(fold change) > = 1.5) in LT-HSC, ST-HSC and MPP, respectively. Furthermore, several known HSC signatures, including Mpl, c-Myc, Mllt3, Gata2, showed significantly higher expression in LT-HSC (Fig. 2h).
For bulk RNA sequencing, principal component analysis showed the three cell types separate on PC1 that accounts for 53.7% variation (Fig. 3d) after batch correction. Similar to the single-cell data, ST-HSC lied in between LT-HSC and MPP, corresponding to their biological status. Taken together, these results showed that the cells were correctly sorted and both the bulk and single-cell short-read RNA-seq data were of high quality.
Quality control and concordance for long-read sequencing data. The number of reads, read length, percentage of mapped reads and quality statistics were provided in Online-only Table 3. Nanopore sequencing data had a mean length of 1024 bp (Fig. 4a). Whereas PacBio sequencing data had the mean length of 946 bp (Fig. 4a). The quality score was higher for PacBio sequencing than nanopore sequencing (Fig. 4b), with an average of 47.57 and 10.53, respectively. The quality scores were comparable amongst all three cell types for both sequencing methods. Next, we compared the short-and long-read sequencing in terms of the precision of identifying exon and transcripts with or without reference. We found that long-read sequencing can provide relatively complete chains of exon including novel exons at transcript-level with or without reference (Fig. 4c,d), whereas short-read sequencing provides higher accuracy in defining exon junctions when a reference is available (Fig. 4d).
To assess the concordance between the replicates, we calculated the correlations of gene quantification between the short-read and long-read sequencing. The correlation coefficients were all over 0.93 (Fig. 4e), indicating high concordances among replicates. Moreover, the PCA illustrated that the samples were clustered by cell type between short-read and long-read sequencing replicates after batch correction (Fig. 4f). These results suggested that long-read sequencing data were of high quality, and the biological replicates are of high concordance. Moreover, the long-read sequencing enables the de novo identification and quantification of novel exon and transcripts. overall alternative splicing patterns. To investigate the overall alternative splicing patterns using this long-read dataset, we first identified the alternative splicing events and types using SUPPA2 24 . Interestingly, the most prevalent type of alternative splicing was retained intron (RI) in all the cell types, followed by skipped exon (SE) and alternative 3′ or 5′ splice sites (Fig. 5a). Next, we revealed that more than 21,762 alternative splicing events were cell-type specific (Fig. 5b). The SE was the most-shared alternative splicing type among cell types (Fig. 5c), followed by RI and alternative 3′ or 5′ splice sites. These results suggested that the long-read sequencing facilitated the identification of large numbers of cell-specific or shared alternative splicing events that may be of potential function in hematopoiesis.   www.nature.com/scientificdata www.nature.com/scientificdata/ Alternative splicing isoforms identification and quantification. To further confirm the quality of long-reads in identifying alternative splicing isoforms, we visualized the transcripts of known LT-HSC maker, c-Myc and Gata2 (Fig. 2h) in the three cell types using nanopore and PacBio sequencing data. We screened all the reads mapped to the genes c-Myc and Gata2 and their annotated transcripts. For c-Myc, we found 2915 reads, and LT-HSC had the highest number of reads (Fig. 6a). We visualized reads of full-length transcripts both in nanopore and PacBio sequencing and found that all annotated isoforms were identified (Fig. 6c). A 5′ alternative start site was found in the first exon of c-Myc. We used the short-read sequencing data to quantify the percentage of spliced in (PSI) for this locus and found that the longer isoform has a similar PSI (percent of splice-in) in all three cell types with ST-HSC had a slightly higher proportion of reads containing the longer isoform of the first exon (Fig. 6e). For Gata2, we found 595 reads, and similarly, LT-HSC had nearly 20 times more reads than the other two cell types (Fig. 6b). By comparing the full-length reads and annotated transcripts, we found an unannotated in Ensembl transcript with intron retention in LT-HSC (Fig. 6d). We further verified this intron retention using short-read sequencing data and quantified PSI for this intron with the highest PSI in LT-HSC (Fig. 6f). Next, we showed an example of integrating long-read sequencing with the single-cell RNA-seq based on Smart-seq2. We identified an alternative 5′ splicing site (A5) with a 24-bp alternative-spliced region in Mpl from our long-read sequencing data. Using our Smart-seq2 data, we identified that this event is differentially spliced with a decreasing PSI from LT-HSC (0.47), ST-HSC (0.45) to MPP (0.36) (Fig. 6g-j). The heterogeneity among single cells can be visualized using the Smart-seq2 data (Fig. 6h), and the involved long-and short-splicing junction (SJ) were significantly down-regulated during the hematopoiesis (Fig. 6i,j). These results indicated that integrating short-and long-read sequencing facilitates the identification of differentially regulated isoforms.

Usage Notes
The bulk short-read RNA-seq can be used to quantify gene expression and alternative exon usage with high accuracy in various tissues or cell types. Single-cell RNA sequencing is powerful in revealing heterogeneity in gene expression within cell types. Full-length RNA-seq protocols such as Smart-seq2 also enable measurements of heterogenicity in alternative splicing. However, the assembly of transcripts using short-read sequencing remains difficult.
Our dataset provides a unique opportunity in revealing transcript diversities in the HSC population. By integrating the short-and long-read bulk sequencing datasets, one can better identify and quantify (novel) alternative splicing isoforms as shown above. The scRNA-seq data can further provide information on how these transcripts varied within different HSC cell types. Furthermore, this dataset is useful in developing statistical models to reconstruct the isoforms and enables further investigation of the largely unexplored post-transcriptional regulations such as alternative splicing and RNA editing at single-cell levels.
Gene expression and alternative splicing are regulated in gender-, age-and strain-specific manners, we generated data from female adult C57BL/6 J mice between 8-9 weeks in this study. Thus, users should take into consideration the gender, development stage and mouse strain when concluding. We expect that the accumulation of data will provide a more complete transcriptional and post-transcriptional landscape of hematopoiesis.

Code availability
The codes used in this article were deposited in https://github.com/LuChenLab/hemato.