Background & Summary

Hematopoiesis starts with a population of self-renewing hematopoietic stem cells (HSCs), which produce a series of increasingly more lineage-committed progenitor cells, ultimately giving rise to all types of mature blood cells. In the conventional model, long-term (LT) HCSs differentiates into short-term (ST) HSCs and subsequently multipotent progenitors (MPPs). Although all three populations have full multilineage differentiation capacity, they gradually lose self-renewal ability. Heterogeneity is presented in both HSC and MPP populations with distinct lineage bias1.

Both transcriptional and post-transcriptional regulations are pivotal in balancing the constitutive and low-level HSC turnover and downstream differentiation and hematopoietic reconstitution. In multicellular organisms, alternative splicing is a key post-transcriptional regulation mechanism that expands transcript diversity. Accumulating studies have revealed that alternative splicing patterns are essential during hematopoiesis. For instance, alternative splicing events specific in blood progenitors2 or megakaryocyte and erythrocyte lineage commitment3 were identified. Alternative splicing patterns in key hematopoietic regulators, such as HMGA2 was found to impact HSC molecular identity4. Moreover, aberrant AS is a hallmark of various types of cancer5, including leukemia6.

RNA sequencing (RNA-seq) that utilises short-read next-generation sequencing (NGS) or long-read sequencing (such as PacBio and Oxford Nanopore Technologies sequencing) is a powerful tool in understanding transcriptional diversity and regulation during various biological processes, including hematopoiesis2. While NGS is more reliable in expression quantification, the short reads can only provide limited information in AS events across the splicing joints. In contrast, the long-read sequencing methods offer a unique opportunity to examine alternative splicing isoforms with unprecedented accuracy in providing full-length information. Here we present comprehensive transcriptomic profiling of mouse HSC and MPP using both short-read and long-read RNA sequencing at bulk and single-cell levels.

Methods

Mice

C57BL/6 J mice were purchased from Beijing HFK Bioscience Co. Ltd (Beijing, China), housed and bred under SPF condition (Specific Pathogen Free) at Laboratory Animal Center of West China Second University Hospital and were allowed access to diet and water ad libitum. All animal experiments were carried out following the protocols approved by the Institutional Animal Care and Use Committee of West China Second University Hospital [(2018) Animal Ethics Approval No.004].

Sample collection

The sample preparation and the overall workflow are summarized in Fig. 1. Female adult mice between 8-9 weeks were anesthetized and sacrificed by cervical dislocation. Bone marrow cells were isolated from the femur and tibia through a 70 μm cell strainer (BD Falcon). Hematopoietic stem and progenitor cells (HSPCs) were first enriched using EasySep™ Mouse Hematopoietic Cell Isolation Kit (STEMCELL, Cat No. 19856) with a lineage cocktail (biotinylated-CD11b, B220, Gr-1, TER-119 and CD3e) to remove mature cells, according to the manufacturer’s instructions.

Fig. 1
figure 1

Sample collection and workflow. (a) Sample collection and sequencing methods. Bone marrow cells (BMC) were isolated, and lineage negative (LIN) cells were enriched using magnetic beads. Long-term (LT) and short-term (ST) hematopoietic stem cells (HSC) and multipotent progenitors (MPP) were sorted according to their surface makers: LT-HSC (Sca-1+c-Kit+CD34 CD135), ST-HSC (Sca-1+ c-Kit+CD34+ CD135) and MPP (Sca-1+ c-Kit+CD34+ CD135+). Single-cell or 100 cells (bulk P100) were sorted for library construction following the Smart-seq2 protocol. The cDNA libraries were used for short-read (Illumina Hiseq) or long-read (Nanopore or PacBio) sequencing. (b) Gating strategy for cell sorting. The main population was gated via FSC-A (forward scatter) and SSC-A (side scatter), and single cells were gated via FSC-A and FSC-H. FSC, forward scatter; SSC, side scatter; A, area; H, height.

To purify LT/ST-HSC and MPP, cells were sorted on a BD FACSAria SORP according to the following markers: LT-HSC [lineage (LIN)Sca-1+c-Kit+CD34CD135], ST-HSC (LINSca-1+c-Kit+CD34+CD135) and MPP (LINSca-1+c-Kit+CD34+CD135+). The following antibodies were used for staining (all purchased from BD): PerCP/Cyanine5.5 anti-CD11b Antibody (Clone M1/70, Cat No. 550993), PerCP/Cyanine5.5 anti-CD3e (Clone 145-2C11, Cat No. 551163), PerCP/Cyanine5.5 anti-TER119 (Clone TER119, Cat No. 560512), PerCP/Cyanine5.5 anti-Gr-1 (Clone RB6-8C5, Cat No. 552093), PerCP/Cyanine5.5 anti-B220 (Clone RA3-6B2, Cat No. 552771), BV421 anti-CD135 (Clone A2F10.1, Cat No. 562898), PE/Cyanine7 anti-Sca-1 (Clone D7, Cat No. 558162), APC/Cyanine7 anti-CD117 (c-Kit) (Clone 2B8, Cat No. 560185), FITC anti-CD34 (Clone RAM34, Cat No. 553733), Fixable Viability Stain 510 (BD Pharmingen, Cat No. 564406). Data were analyzed using FlowJoV10.7.1 software.

For single-cell RNA-sequencing (scRNA-seq), cells were individually sorted into 8-Strip PCR tubes containing the lysis buffer. For bulk RNA-seq, 100 cells (P100) were sorted into one PCR tube as a biological replicates. The batch information for cell sorting was included in Online-only Tables 1, 2.

Library construction and sequencing

The cDNA libraries were generated following the Smart-seq2 protocol7 and the Illumina Nextera XT DNA preparation kit was used. The batch information of cDNA generation and sequencing was included in Online-only Tables 1-3.

Short-read sequencing

cDNA was then sheared randomly by Bioruptor Pico sonication device (Diagenode) for Illumina library preparation protocol including DNA fragmentation, end repairing, 3′ ends A-tailing, adapter ligation, PCR amplification and library validation. After library preparation, PerkinElmer LabChip® GX Touch and Step OnePlus™ Real-Time PCR System were introduced for library quality inspection. Qualified single-cell or P100 libraries were then loaded on Illumina Hiseq X Ten platform for PE150 sequencing. The average sequencing depth was 23.3 million reads (SD = 3.41) for P100 and 4.2 million reads (SD = 1.1) for single cells. Sequencing library construction and sequencing were done in Annoroad Gene Technology (Beijing, China).

PacBio sequencing

The full-length cDNA libraries were generated as described above, 5 µg of full-length cDNA were used for size selection using the BluePippin™ Size Selection System (Sage Science, Beverly, MA, USA). SMRTbell library was constructed using 1 μg size-selected (above 4 kb) cDNA with the Pacific Biosciences SMRTbell template prep kit. The binding of SMRTbell templates to polymerases was conducted using the Sequel II Binding Kit, and then primer annealing was performed. Sequencing was carried out on the Pacific Bioscience (PacBio) Sequel II platform. Sequencing library construction and sequencing were done in Annoroad Gene Technology (Beijing, China).

Oxford Nanopore Technologies cDNA sequencing

Similarly, the full-length cDNA was generated using the Smart-seq2 protocol as described above. About 250 ng cDNA were subjected to end-repair and dA-tailing ligation using NEBNext FFPE DNA Repair Mix (NEB, Cat No. M6630L) and NEBNext Ultra II End Repair/dA-Tailing Module (NEB, Cat No. E7645). PromethION library preparation was performed according to the manufacturer’s instructions (Oxford Nanopore Technologies, SQK-LSK109 Kit, EXP-NBD104 and EXP-NBD114). Sequencing was done on the Promethon p48 platform. Sequencing library construction and sequencing were done in Biomarker Technologies Co., Ltd (Beijing, China).

Sequencing data analysis

All programs used default parameters unless stated otherwise in the corresponding section.

Single-cell short-read sequencing

Single-cell analysis was carried out using the Seurat R package (v3.2)8. We filtered cells using the following criteria: (a) feature counts below 500 or above 8000, (b) UMI counts below 70,000 or over 5,000,000, (c) identified as doublets using DoubletFinder (v2.0.3)9 with the parameter of “PCs = 1:6, pN = 0.25, pK = 0.09, nExp = 24”. A total of 109 LT-HSCs, 98 ST-HSCs and 107 MPPs were kept for further analyses. The quality statistics were provided in Online-only Table 1. We used principal component analysis (PCA) with variable genes as input and identified the top 6 significant PCs that were used as input for tSNE (t-distributed stochastic neighbor embedding). Cell markers from previously published studies10,11 were used to verify the cell types. Differentially expressed genes in the cell types were determined using the FindAllMarkers function in Seurat R package (v3.2)8.

Bulk short-read sequencing

The quality of the raw sequence data was analyzed using FastQC software (v0.11.8) (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and RSeQC software12 (v3.0.1) (http://rseqc.sourceforge.net/). The paired-end short-reads were aligned to the mouse reference genome GRCm38 with annotation from ENSEMBL release 93 using STAR (v2.7.1a)13 with the parameters of “—outSJfilterOverhangMin 12 12 12 12 —alignSJoverhangMin 3 —alignSJDBoverhangMin 3 —chimSegmentMin 12 —chimScoreMin 2 —chimScoreSeparation 10 —chimJunctionOverhangMin 12 —outFilterMultimapNmax 1 —chimOutType Junctions SeparateSAMold”.

Gene body coverage, distribution of aligned reads over genome feature and RNA integrity at transcript-level were calculated using RSeQC12. The RNA integrity at the transcript-level was evaluated using the Transcript Integrity Number (TIN) algorithm14. These quality statistics were provided in Online-only Table 2. Gene expression was quantified using function summarizeOverlaps from GenomicAligments (v1.24.0)15 with parameters of “mode = “Union”, singleEnd = FALSE, ignore.strand = TRUE, fragments = TRUE”. The read counts were then normalized by function rlog with the parameter of “blind = FALSE” in DESeq2 (v1.28.1)16. Principal component analysis (PCA) and sample distances were calculated using FactoMineR (v2.4)17 and DESeq2 (v1.28.1)16, respectively. The batch effect (provided in Online-only Table 2) was corrected using ComBat from the sva package (v3.36.0)18.

Bulk long-read sequencing

The alignment was performed using Minimap2 (v2.17-r974-dirty)19 with the parameter of “-ax splice”. The number of reads, read length and quality were analyzed using NanoComp software (v1.33.1)20 with the parameter of “–raw–store–tsv_stats” and visualized using R package ggplot2 (v3.3.2)21 after filtering the reads with length below 200 or over 150,000 bp. These quality statistics were provided in Online-only Table 3.

Stringtie2 (v2.1.7) was used to assemble and quantify both short and long reads with or without a reference guiding. The parameter of the assembly process was “-p 20 -e -G -A”, and “-L” for long reads. The read counts of genes were extracted with a python script (prepDE.py3) with “-l 600” and normalized by function rlog with the parameter of “blind = FALSE” in DESeq2 (v1.28.1)16. The batch effects of gene expression from bulk Illumina, PacBio and Oxford Nanopore Technologies sequencing were corrected using ComBat from the sva package (v3.36.0)18. Principal component analysis (PCA) was calculated using plotPCA function in DESeq2 (v1.28.1)16. Correlation was calculated by function cor from R package stats (v4.0.2) with the parameter of “method = pearson”. GffCompare (v0.12.1)22 was used to compare and evaluate the accuracy of Stingtie223 transcript assembly.

SUPPA2 (v2.3)24 was employed to classify the AS events with the parameter of “-f ioe -e SE SS MX RI FL”. Seven AS types were identified, including skipping exon (SE), retained intron (RI), alternative 5′ splice site (A5), alternative 3′ splice site (A3), mutually exclusive exons (MX), alternative first exon (AF), in which alternative first-exon use results in mRNA isoforms with distinct 5′ UTRs, and alternative last exon (AL), in which alternative use of multiple polyadenylation sites results in distinct terminal exons.

To visualize long-reads for transcripts, the gene regions were extracted using Samtools (v1.10.2)25. Bedtools (v2.30.0)26 was used to convert all bam files to GTF format. Visualization was performed using R package ggbio (v1.36.0)27. Sashimi plots of short-read sequencing data were plotted using pysashimi (https://github.com/ygidtu/pysashimi).

Data Records

The raw data were deposited at NCBI BioProject under accession number PRJNA70606628. The accession number for individual samples were summarized in Online-only Tables 1-3. The processed files, including the quantifications of gene expression, isoform, splicing junction from both protocols were uploaded in figshare29.

Technical Validation

Quality control for short-read Illumina sequencing data

Read quality was assessed using FastQC, the number of total and clean reads, percentage of unique mapped reads, the mapped read length, Q30 and TIN (Transcript Integrity Number) statistics were provided in Online-only Tables 1-2. TIN indicates the RNA integrity at transcript-level14. The bulk RNA-seq samples generally have higher TIN scores than the single-cell samples. All single-cell samples had median TIN scores ranging from 24.8 to 51.8, with an average value of 40.7. Similar ranges of TIN were observed for all cell types (Online-only Table 1). Whereas the bulk samples had median TIN scores ranging from 49.4 to 66.1, with an average value of 56.9 (Online-only Table 2).

There was no significant difference in the distribution of average quality score per base among the samples of different cell types for both single-cell (Fig. 2a) and bulk (Fig. 3a) samples, and the reads were distributed approximately uniformly across the gene body for both datasets (Figs. 2b, 3b), indicating the high integrity of the RNA. We further checked the gene regions where the reads were mapped to and found that all samples have significantly more reads mapped to the exonic regions, while less were mapped to the intronic regions (Fig. 2c and Fig. 3c), in line with the observation of the previous reports30.

Fig. 2
figure 2

Quality check of single-cell Illumina short-read sequencing. (a) Boxplot showing average sequence quality per base for LT-HSC (red), ST-HSC (green) and MPP (blue) single-cell samples. (b) Gene body coverage of all detected genes. (c) Frequency of counts mapped in various gene regions. Bar chart showing mean and standard deviation. CDS, coding sequence; TES, transcription end site; TSS, transcription start site. (d) Violin plots showing percentage of reads that mapped to mitochondrial (orange) or ribosomal (green) transcripts. Violin plots showing the number of detected genes per cell (e) or the number of counts per barcode (f). (g) tSNE of three different cell types. (h) Heatmap showing the top 30 differentially expressed genes for each cell type (FDR < 0.05, log2(fold change) > = 1.5). The color scale indicates the expressing level.

Fig. 3
figure 3

Quality check of bulk Illumina short-read sequencing. (a) Boxplot showing average sequence quality per base for LT-HSC (red), ST-HSC (green) and MPP (blue), 5 replicates per cell type. (b) Gene body coverage of all detected genes. (c) Frequency of counts mapped in various gene regions. Bar chart showing mean and standard deviation. CDS, coding sequence; TES, transcription end site; TSS, transcription start site. (d) Principal Component Analysis (PCA) of all samples after batch correction.

For single-cell sequencing data, we also examined the proportion of reads mapped to mitochondrial and ribosomal genes (Fig. 2d). The median percentage of mitochondrial genes per cell was 0.29 and the median percentage of ribosomal genes per cell was 3.04. MPP had the highest average number of detected genes (Fig. 2e), significantly higher than LT-HSC (p < 0.001, T-test), while the number of UMI counts per cell were comparable amongst three cell types (Fig. 2f). UMAP graph also showed that ST-HSC lied in between LT-HSC and MPP (Fig. 2g). The differentially expressed genes were analyzed between the cell types. There were 62, 63 and 266 differentially expressed genes (FDR < 0.05, log2(fold change) > = 1.5) in LT-HSC, ST-HSC and MPP, respectively. Furthermore, several known HSC signatures, including Mpl, c-Myc, Mllt3, Gata2, showed significantly higher expression in LT-HSC (Fig. 2h).

For bulk RNA sequencing, principal component analysis showed the three cell types separate on PC1 that accounts for 53.7% variation (Fig. 3d) after batch correction. Similar to the single-cell data, ST-HSC lied in between LT-HSC and MPP, corresponding to their biological status. Taken together, these results showed that the cells were correctly sorted and both the bulk and single-cell short-read RNA-seq data were of high quality.

Quality control and concordance for long-read sequencing data

The number of reads, read length, percentage of mapped reads and quality statistics were provided in Online-only Table 3. Nanopore sequencing data had a mean length of 1024 bp (Fig. 4a). Whereas PacBio sequencing data had the mean length of 946 bp (Fig. 4a). The quality score was higher for PacBio sequencing than nanopore sequencing (Fig. 4b), with an average of 47.57 and 10.53, respectively. The quality scores were comparable amongst all three cell types for both sequencing methods. Next, we compared the short- and long-read sequencing in terms of the precision of identifying exon and transcripts with or without reference. We found that long-read sequencing can provide relatively complete chains of exon including novel exons at transcript-level with or without reference (Fig. 4c,d), whereas short-read sequencing provides higher accuracy in defining exon junctions when a reference is available (Fig. 4d).

Fig. 4
figure 4

Quality check of bulk long-read sequencing. (a) Boxplots showing the length of reads per sample for nanopore (left) or PacBio (right) sequencing. LT-HSC (red), ST-HSC (green) and MPP (blue). (b) Violin plots showing read quality per sample for nanopore (left) or PacBio (right) sequencing. The white dots represent the means and the dashed line indicates the quality value of 7. (c) Precision of assembly accuracy at exon and transcript levels of Stingtie2 without a reference. (d) Precision of assembly accuracy at exon and novel exons levels of Stingtie2 with a reference. (e) Correlation coefficients of gene expression among the three sequencing methods. (f) Principal Component Analysis (PCA) of all samples from bulk Illumina, PacBio and Nanopore sequencing methods after batch correction.

To assess the concordance between the replicates, we calculated the correlations of gene quantification between the short-read and long-read sequencing. The correlation coefficients were all over 0.93 (Fig. 4e), indicating high concordances among replicates. Moreover, the PCA illustrated that the samples were clustered by cell type between short-read and long-read sequencing replicates after batch correction (Fig. 4f). These results suggested that long-read sequencing data were of high quality, and the biological replicates are of high concordance. Moreover, the long-read sequencing enables the de novo identification and quantification of novel exon and transcripts.

Overall alternative splicing patterns

To investigate the overall alternative splicing patterns using this long-read dataset, we first identified the alternative splicing events and types using SUPPA224. Interestingly, the most prevalent type of alternative splicing was retained intron (RI) in all the cell types, followed by skipped exon (SE) and alternative 3′ or 5′ splice sites (Fig. 5a). Next, we revealed that more than 21,762 alternative splicing events were cell-type specific (Fig. 5b). The SE was the most-shared alternative splicing type among cell types (Fig. 5c), followed by RI and alternative 3′ or 5′ splice sites. These results suggested that the long-read sequencing facilitated the identification of large numbers of cell-specific or shared alternative splicing events that may be of potential function in hematopoiesis.

Fig. 5
figure 5

Overall patterns of alternative splicing and cell type-specific/shared events of bulk long-read sequencing. (a) Bar plot of the percentages of alternative splicing types in three cell types based on SUPPA2. (b) Bar plot showing the numbers of specific and shared alternative splicing events in three cell types. (c) Percentage of alternative splicing types for the shared alternative splicing events among cell types. SE, skipping exon; RI, retained intron; A5, alternative 5′ splice site; A3, alternative 3′ splice site; MX, mutually exclusive exons; AF, alternative first exon; AL alternative last exon.

Alternative splicing isoforms identification and quantification

To further confirm the quality of long-reads in identifying alternative splicing isoforms, we visualized the transcripts of known LT-HSC maker, c-Myc and Gata2 (Fig. 2h) in the three cell types using nanopore and PacBio sequencing data. We screened all the reads mapped to the genes c-Myc and Gata2 and their annotated transcripts. For c-Myc, we found 2915 reads, and LT-HSC had the highest number of reads (Fig. 6a). We visualized reads of full-length transcripts both in nanopore and PacBio sequencing and found that all annotated isoforms were identified (Fig. 6c). A 5′ alternative start site was found in the first exon of c-Myc. We used the short-read sequencing data to quantify the percentage of spliced in (PSI) for this locus and found that the longer isoform has a similar PSI (percent of splice-in) in all three cell types with ST-HSC had a slightly higher proportion of reads containing the longer isoform of the first exon (Fig. 6e). For Gata2, we found 595 reads, and similarly, LT-HSC had nearly 20 times more reads than the other two cell types (Fig. 6b). By comparing the full-length reads and annotated transcripts, we found an unannotated in Ensembl transcript with intron retention in LT-HSC (Fig. 6d). We further verified this intron retention using short-read sequencing data and quantified PSI for this intron with the highest PSI in LT-HSC (Fig. 6f). Next, we showed an example of integrating long-read sequencing with the single-cell RNA-seq based on Smart-seq2. We identified an alternative 5′ splicing site (A5) with a 24-bp alternative-spliced region in Mpl from our long-read sequencing data. Using our Smart-seq2 data, we identified that this event is differentially spliced with a decreasing PSI from LT-HSC (0.47), ST-HSC (0.45) to MPP (0.36) (Fig. 6g–j). The heterogeneity among single cells can be visualized using the Smart-seq2 data (Fig. 6h), and the involved long- and short-splicing junction (SJ) were significantly down-regulated during the hematopoiesis (Fig. 6i,j). These results indicated that integrating short- and long-read sequencing facilitates the identification of differentially regulated isoforms.

Fig. 6
figure 6

Alternative splicing isoforms identified using the long-read sequencing and quantified using the short-read sequencing. Sashimi plots showing long-reads mapped to c-Myc (a) or Gata2 (b) in LT-HSC (red), ST-HSC (green) and MPP (blue) by Nanopore or PacBio sequencing. The identified nearly full-length transcripts (upper panel) and annotated transcripts in Ensembl (lower panel) for c-Myc (c) and Gata2 (d) were shown. Sashimi plots showing Illumina generated short reads that are mapped to the indicated (shaded by grey) region for c-Myc (e) or Gata2 (f ). (g) Sashimi plots of Mpl gene showing the decreased PSI of an alternative splicing event in LT-HSC (red), ST-HSC (green) and MPP (blue) by summing up the reads from all the single cells of each type. The percentage of spliced-in (PSI) for the indicated regions was calculated for each cell type. (h) Heatmap illustrates that the heterogeneity of alternative splicing of Mpl among single cells within each cell type. (i,j) Violin plots showing the normalized splicing junction counts of the short junction at chr4:118455281–118455644:- (i) and the long junction at chr4:118455281–118455668:- (j) in LT-HSC (red), ST-HSC (green) and MPP (blue).

Usage Notes

The bulk short-read RNA-seq can be used to quantify gene expression and alternative exon usage with high accuracy in various tissues or cell types. Single-cell RNA sequencing is powerful in revealing heterogeneity in gene expression within cell types. Full-length RNA-seq protocols such as Smart-seq2 also enable measurements of heterogenicity in alternative splicing. However, the assembly of transcripts using short-read sequencing remains difficult.

Our dataset provides a unique opportunity in revealing transcript diversities in the HSC population. By integrating the short- and long-read bulk sequencing datasets, one can better identify and quantify (novel) alternative splicing isoforms as shown above. The scRNA-seq data can further provide information on how these transcripts varied within different HSC cell types. Furthermore, this dataset is useful in developing statistical models to reconstruct the isoforms and enables further investigation of the largely unexplored post-transcriptional regulations such as alternative splicing and RNA editing at single-cell levels.

Gene expression and alternative splicing are regulated in gender-, age- and strain-specific manners, we generated data from female adult C57BL/6 J mice between 8-9 weeks in this study. Thus, users should take into consideration the gender, development stage and mouse strain when concluding. We expect that the accumulation of data will provide a more complete transcriptional and post-transcriptional landscape of hematopoiesis.