Introduction

Quantification and comparison of transcript expression are essential to understanding the role of RNA in different physiological conditions or developmental stages. Such experiments and analyses are widely used in the studies of molecular biology. Over the past decades, several biological technologies have been developed to quantify the abundance of transcripts, such as expression microarray1 and high-throughput RNA sequencing (RNA-Seq)2. For organisms with sufficient genomic information, the design of microarray provides a high throughput and cost-effective solution to examine transcript expression. On the other hand, RNA-Seq is superior in delivering lower background signals and larger dynamic ranges3. Despite the fact that many genome sequencing projects have been carried out, such as Genome10K4, 5000 arthropod genomes initiative (i5K)5 and Bird10K6, whole genome studies are still demanding efforts for many research groups. For non-model organisms, the expression microarray needs to rely on cross-species hybridization7. On the contrary, RNA-Seq is more suitable owing to its capability of detecting novel transcripts without additional genomic information3.

When the reference genome and transcriptome are not available, RNA-Seq reads are first used to reconstruct the transcriptome8,9. Nowadays, many programs have been developed for de novo transcriptome assembly, such as Oases10, rnaSPAdes11, SOAPdenovo-Trans12, Trans-ABySS13 and Trinity14. After transcriptome sequences are reconstructed, quantification methods including BitSeq15, Kallisto16, RSEM17 and Salmon18 can be applied. These methods are able to inference the abundance of expression without the need of genomic sequences, using the number of RNA-Seq reads that overlap with the assembled contigs9. Nevertheless, quantification is much more challenging without reliable reference sequences because of the erroneous contigs produced by the assemblers, which often result from sequencing errors, insufficient sequencing depth and biological variability19. To address these problems, a great number of comparative studies have been published recently. While many studies evaluated transcriptome assembly19,20,21 or quantification22,23 programs independently, few have discussed how transcriptome assembly influences the downstream quantification analysis. In 2013, Vijay, N., et al. performed an in silico assessment of RNA-Seq experiments. That study examined the impact of various aspects of sequencing reads on transcriptome assembly and differentially expressed genes (DEG) analysis7, but the effect of redundant contigs and multiple-mapping reads on quantification was not well discussed. Another study conducted by Wang, S. and M. Gribskov evaluated the quality of assembled contigs and their effects on DEG analysis24. However, their study mainly focused on the evaluation of entire workflow from assembly, quantification, to DEG analysis, which makes it obscure to unravel how the erroneous contigs affect the authenticity of downstream analysis. In addition, some studies investigated the reliability of quantification algorithms by utilizing the information regarding splicing junctions. For example, in 2019, Soneson, C., et al. devised a junction coverage compatibility (JCC) score, which compares the observed and predicted counts of junction spanning reads to quantify the reliability of transcript quantification25. Afterwards, Cong Ma, et al.26 used the deviation of the observed read counts from the expectation of quantification model on the whole transcripts to identify anomalies and adjust the estimation of abundance accordingly. It should be noticed that the implementation of these ideas requires proper annotation of the genome, such as the splicing junctions and coordinates of untranslated regions (UTR) for the transcripts. In this regard, it remains challenging to analyze the quantification reliability of the assembled contigs generated from de novo assembly without proper annotation of the genome.

In this study, we used both in silico simulated and experimental RNA-Seq data from three species (yeast, dog, and mouse). The reads were assembled using three state-of-the-art assemblers, namely rnaSPAdes, Trinity and Trans-ABySS. After that, the assembled contigs were evaluated based on TransRate19 scores, which were previously proposed to assess the quality of de novo transcriptome assemblies using the alignments of sequencing reads to the assembled sequences. After de novo assembly, the reference transcripts were assigned to assembled contigs according to the BLASTn27 alignments. Each transcript-contig alignment was then categorized based on accuracy, recovery and sequence ambiguity. Subsequently, we thoroughly examined the impact of erroneous contigs on the quantifiers Kallisto, RSEM and Salmon. By exploring the interplay between each stage in RNA-Seq analysis workflow, this study provides valuable insights into conducting RNA-Seq analysis and we anticipate these discoveries would be useful in the future development of assembly or quantification algorithms.

Materials and Methods

Datasets

Three experimental and three simulated RNA-Seq datasets were used in this study. Both experimental and simulated data included three species: yeast (Saccharomyces cerevisiae), dog (Canis lupus familiaris) and mouse (Mus musculus). The experimental datasets were collected from the Sequence Read Archive (SRA). The yeast dataset (SRR453566) was from the study of Nookaew et al.28, comprising 5.5 million non-stranded paired-end reads cultivated under the batch condition. The dog dataset (SRR882109) was produced by Liu, et al.29, comprising 20.8 million non-stranded paired-end reads sampled from normal mammary gland tissues of domestic dogs. Finally, the mouse dataset (SRR203276) was collected from the study of Grabherr, et al.14, containing 43.4 million stranded paired-end reads extracted from dendritic cells. For the simulated datasets, Flux Simulator (ver. 1.2.1)30 was adopted to synthesize RNA reads for yeast, dog and mouse, respectively, based on the genomic sequences and annotations from the Ensembl database31. To facilitate the analysis, only the transcripts annotated as messenger RNA (mRNA) and with over 500 nucleotides in length were extracted. The parameters used for the simulation are shown in Supplementary File 1: Table S1. In total, 81.7 million non-stranded paired-end reads were generated for the simulated dataset. The quality of both experimental and simulated datasets was examined using FastQC (ver. 0.11.5)32 and the low-quality subsequences of the reads were trimmed using Trimmomatic (ver. 0.36)33 with parameters SLIDINGWINDOW:4:20 MINLEN:30 (this setting increased the threshold of sequencing quality and retained more RNA reads when compared to the default parameters). The resultant RNA reads that were unable to maintain the paired relation were discarded. The detailed information of the processed RNA reads is provided in Supplementary File 1: Table S2 (The mean and standard deviation of the insert sizes were estimated based on the alignments produced by Burrows-Wheeler Aligner34).

For some projects that were originally designed to first assemble a qualified reference transcriptome, the sequencing depth is usually higher than that adopted in transcriptome quantification projects. To ensure the conclusions drawn in this study are consistent across different sequencing depths, we created additional datasets with a higher sequencing depth. For yeast, we adopted two additional datasets SRR453567 and SRR453568, which are the biological replicates of the yeast data we used (SRR453566), to create a new dataset with a high-sequencing depth, denoted as the experimental (H) yeast dataset. As for the experimental (H) dog dataset, we adopted another dataset SRR882105, which has a higher sequencing depth in the same research of the experimental dog dataset (SRR882109). Similarly, we used another parameter profiles (Supplementary File 1: Table S1) to generate synthetic RNA-Seq datasets with a higher sequencing depth, denoted as simulated (H) datasets, which consist of 185.7 million non-stranded paired-end reads. The detailed information of the simulated (H) and experimental (H) datasets is provided in Supplementary File 1: Table S2.

Expression metrics

In order to evaluate the performance of transcript quantification, the ground truth of expression abundance for each transcript must be first determined. For simulated datasets, the number of the generated RNA reads for each transcript was recorded during the simulation process. Since transcriptome assemblers sometimes generate duplicated, incomplete or over-extended contigs, the metrics we use for quantifying expression must consider the normalization with respect to both sequence length and the number of total nucleotides. In this regard, the number of generated RNA reads was transformed into a simplified version of Transcripts per Million (TPM)35 using the Eq. (1):

$${\rm{Ground}}\,{\rm{Truth}}\,{{\rm{TPM}}}_{i}=\frac{{f}_{i}/{l}_{i}}{{\sum }_{k=1}^{n}{f}_{k}/{l}_{k}}\times {{\rm{10}}}^{6},$$
(1)

where n is the total number of transcripts, fi is the number of RNA reads generated from transcript i and li is the effective length36 of transcript i. In contrast, because the ground truth abundance for each RNA molecule is unknown for experimental datasets, we calculated the average TPM inferred by Kallisto (ver. 0.43.0), RSEM (ver. 1.2.31) (default parameters) and Salmon (ver. 0.8.2) for the reference transcript as the ground truth expression when evaluating the abundance of an assembled contig. Although the estimated expression might not perfectly reflect the real number of RNA molecules in a biological sample, it still provides valuable information when comparing the performance of quantification before and after de novo transcriptome assembly.

de novo transcriptome assembly and quantification

The processed RNA-Seq reads were assembled into contigs using the following three programs: (1) rnaSPAdes (ver. 3.11.1), (2) Trans-ABySS (ver. 1.5.5) along with ABySS (ver. 1.5.2)37 and (3) Trinity (ver. 2.4.0) with default parameters. To minimize the effect of fragmented contigs, only the contigs with over 500 nucleotides in length were kept for the quantification analysis. The assemblies were evaluated based on the length of contigs, the number of recovered transcripts, the number of erroneous contigs and the evaluation scores provided by TransRate. The TransRate scores that we used in this study are the score of bases covered, score of good mapping, score of not segmented and overall score. The score of bases covered represents the proportion of nucleotide bases in a contig that are covered by reads. The score of good mapping represents the proportion of read pairs of which both reads are aligned in the correct orientation on a single contig. The score of not segmented represents the proportion of contigs that might be a chimera of multiple transcripts. Subsequently, the expression abundance for each contig was estimated using one alignment-based and two alignment-free quantifiers, namely (1) Bowtie2 (ver. 2.3.0)38 (–dpad 0–gbar 99999999–mp 1,1–np 1–score-min L,0,-0,1 -k 200–sensitive–no-mixed–no-discordant) followed by RSEM (ver. 1.2.31) (default parameters); (2) Kallisto (ver. 0.43.0) (indexing with -k 31 and quantifying with default parameters); and (3) Salmon (ver. 0.8.2) (indexing with -k 31 and quantifying with default parameters).

Transcript assignment

For the purpose of comparing the estimated abundance of contigs with the ground truth expression from the corresponding transcripts, we assigned the reference transcripts (cDNA) to assembled contigs based on BLASTn (2.5.0+)27 alignments. Here, only the high scoring pairs (HSPs) with identity over 70% and E-value smaller than 1E−5 were considered. We integrated the remained HSPs onto the coordinates of both transcript and contig to obtain the global alignment. Similar to a previous study7, we calculated the recovery and accuracy for each global alignment, which refer to the proportion of matched nucleotides on the transcript and the proportion of correctly matched nucleotides on the contig respectively (Supplementary File 2: Fig. S1). Furthermore, we defined the overall alignment score as \(\sqrt{{recovery}\,\times \,{accuracy}}\). A transcript is assigned to a contig if either accuracy or recovery of the global alignment between them is above 90%. In this manner, we were able to identify all the corresponding transcripts for each contig. Note that it is possible that a contig can be associated with multiple transcripts, and a transcript can assign to multiple contigs as well. We considered multiple assignments here in order to understand the impact of redundant sequences on the quantification. Once the transcripts have been assigned to the contigs, we used Eq. (2) to calculate the relative error of expression, in order to evaluate the quality of transcript quantification for each transcript-contig pair:

$${\rm{Relative}}\,{{\rm{Error}}}_{(i{,}{j})}=\frac{{TP}{{M}}_{i}^{{est}}-{TP}{{M}}_{{j}}^{{g}{.t}}}{{TP}{{M}}_{{i}}^{{est}}+{TP}{{M}}_{{j}}^{{g}{.t}}}\times \mathrm{100} \% ,$$
(2)

where the \({TP}{{M}}_{i}^{{est}}\) is the expression estimated by quantifiers for contigi, and the \({TP}{{M}}_{j}^{{g}{.t}}\) is the ground truth expression for transcriptj given (contigi, transcriptj) is a valid global alignment (either accuracy or recovery of the global alignment between them is above 90%).

Sequence ambiguity

To determine the origin of the RNA-Seq reads that can be mapped to multiple transcripts is an important issue for the development of quantification algorithms. In this regard, it is of interest to understand the impact of sequence ambiguity on transcript quantification. We performed pairwise sequence alignment on both transcripts and contigs using BLASTn, respectively. Here, only the HSPs with identity over 70% and E-value smaller than 1E−5 were considered as potential ambiguity. In addition, to better explicate the relation between sequences that share similar subsequences, we build a connected component graph, where two sequences were grouped into the same connected component if the proportion of identical nucleotides between them is over 90% of the either sequence (Fig. 1). The size of a connected component is defined as the number of sequence members inside. We call the sequences in a connected component which containing only one sequence as unique sequence. Furthermore, we used the read proportion of estimated abundance (RPEA) of a contig in a connected component to investigate the behavior of quantifier while ambiguous sequences are presented. Given n contigs \({c}_{1},{c}_{2},\ldots {c}_{i}\ldots {c}_{n}\) in the same connected component C, the RPEA score for contig i is defined as follow:

$${{\rm{RPEA}}}_{i}=\frac{{\rm{Allocated}}\,{\rm{Read}}\,{\rm{Count}}\,{\rm{for}}\,{\rm{Contig}}\,{c}_{i}}{{\sum }_{j=1}^{n}{\rm{Allocated}}\,{\rm{Read}}\,{\rm{Count}}\,{\rm{for}}\,{\rm{Contig}}\,{c}_{j}}.$$
(3)
Figure 1
figure 1

Construction of Ambiguity Network. The diagram illustrates how pairwise alignments in a contig set are employed to construct ambiguity networks. The ambiguity network is first initialized by given the contig sequences, creating a single cluster for each sequence. By analyzing the global alignments between contigs (the blue dot lines), the cluster in the network expands by joining two contig clusters at a time if the alignment length between the sequence is over 90% of the length for either sequence. In this study, the ambiguity network can be constructed for both contig and transcript sets. For the purpose of simplicity, we only illustrated the scenario for contigs in this figure.

If the highest RPEA in the connected component is close to 1, it suggests that the quantifier allocates all the reads in the connected component to one specific contig. In contrast, if the highest RPEA in the connected component is close to 1/n, then it suggests that the quantifier tends to allocate the reads evenly in the connected component.

Contig categories

The assembled contigs are categorized into five particular categories in this study: (1) full-length, (2) incompleteness, (3) over-extension, (4) family-collapse and (5) duplication (Fig. 2). The contigs identified as incompleteness, over-extension, family-collapse and duplication are called erroneous contigs throughout this study. The analysis of the first three categories were not affected by the factor of sequence ambiguity, allowing us to investigate the impact of assembly completeness on quantification independently. Given the length of contig lc and the length of the corresponding transcript lt, the assembly completeness of a contig was examined through the difference in length:

$${\rm{Difference}}\,{\rm{in}}\,{{\rm{Length}}}_{(c,t)}=\frac{{l}_{c}-{l}_{t}}{{l}_{c}+{l}_{t}}\times \mathrm{100} \% {\rm{.}}$$
(4)
Figure 2
figure 2

Examples of Contig Categories. The diagram gives examples for each contig category we analyzed throughout this study. The middle column shows an example for each category, while the right column portrays the relation of contigs and transcripts in network representation. The sequence nodes are connected together in solid line if they are in the same ambiguity cluster. On the other hand, a blue dot arrow represents the transcript assignment for the contigs. The analysis of contigs labeled as full-length, incompleteness and over-extended exclude the factor of sequence ambiguity. In contrast, family-collapse and duplication remove the potential effect of assembly completeness, focusing only on the impact of redundant or duplicated sequences.

In contrast, family-collapse and duplication focused on the contigs that completely recovered the transcripts (recovery ≥ 90) but considered the influence of sequence ambiguity. To be more specific, family-collapse represents contigs which are assigned with multiple transcripts and duplication stands for the multiple contigs assigned by a single transcript. By examining these contigs, the problems caused by the assemblers that fail to distinguish similar transcripts from each other or generate a large number of redundant contigs were investigated. The detailed definitions for contig categories are provided in Supplementary File 1: Table S3.

Release package

We proposed an open-source Python based package QuantEval that builds connected components for the assembled contigs based on sequence similarity and evaluates the quantification results for each connected component. The package can be downloaded from https://github.com/dn070017/QuantEval.

Results

de novo transcriptome assembly

Based on pairwise BLASTn, yeast has the simplest transcriptome, with 94.11% of the transcripts sharing no similar subsequences with others. We call these sequences unique transcripts in the transcriptome. On the other hand, 66.29% of the dog transcripts are unique, while only 28.45% of the mouse transcripts are unique (Supplementary File 1: Table S4). First, we performed transcriptome assembly on the RNA reads of the three species. We adopted three assemblers, rnaSPAdes, Trans-ABySS and Trinity, to construct contigs for both experimental and simulated RNA-Seq reads. Next, we built connected components for the assembled contigs. The statistics of sequence length and sequence ambiguity for the assembled contigs are shown in Supplementary File 1: Table S5, while the numbers of contigs that were categorized in each contig category are shown in Supplementary File 1: Table S6. The proportion of recovered transcripts (recovery ≥ 90) and accurate contigs (accuracy ≥ 90) is shown in Supplementary File 2: Fig. S2. Here, we conducted InterPro gene family enrichment analysis on the two groups of transcripts for both simulated and experimental datasets: “recovered (recovery ≥ 90%)” and “not recovered (recovery < 90%)”, using DAVID functional analysis tool (Fisher’s Exact test)39. The results can be found in Supplementary Materials 3 (18 data sheets). The enriched gene families that are recovered are not exactly the same across different datasets. There is no enriched gene family in the recovered yeast dataset. The Trinity and Trans-ABySS assemblies for the dog dataset reported “IPR012677:Nucleotide-binding, alpha-beta plait” and “IPR000504:RNA recognition motif domain” for high recovery transcripts. However, the rnaSPAdes reported multiple gene families related to WD40 repeat for the dog dataset. For the mouse dataset, three assemblers reported similar gene families enriched in the gene set with high recovery. For instance, IPR000504:RNA recognition motif domain, IPR000719:Protein kinase, catalytic domain, IPR001650:Helicase, C-terminal, IPR001680:WD40 repeat, IPR001683:Phox homologous domain, IPR001841:Zinc finger, RING-type and etc. This suggested that the assemblers performed similarly when reconstructing transcripts in these gene families. Finally, the TransRate scores for each assembly are shown in Supplementary File 2: Fig. S3.

In general, rnaSPAdes constructed the least amount of contigs with the highest overall TransRate score for most of the datasets. As shown in Supplementary File 1: Table S5, rnaSPAdes also delivered the lowest average size of the connected components across species, suggesting that the contigs generated by rnaSPAdes are less redundant when compared to those from the other two assemblers. Trinity outperformed other assemblers in terms of N50 and the proportion of the recovered transcripts in all the simulated datasets. Despite the fact that Trinity generated longer contigs, the overall TransRate scores and the proportions of accurate contigs from Trinity assembly are marginally lower than the assembly constructed by rnaSPAdes. Trans-ABySS constructed the contigs with comparatively high accuracy, with 66.37% of the contigs aligned with at least one transcript that show accuracy higher than 0.90. Nonetheless, Trans-ABySS constructed smaller numbers of unique and long contigs relatively. In other words, Trans-ABySS generated shorter contigs with more redundancy. We also found that Trans-ABySS generated more redundant sequences and the TransRate scores dropped significantly in the datasets with high sequencing depth. The summary of the assemblies also demonstrates that the proportion of recovered transcripts are significantly higher in the datasets of yeast than that of dog or mouse. With a lower number of unique transcript sequences, it appears to become more difficult for the assemblers to properly reconstruct the transcriptome. For the estimated abundance of assembled contigs, the estimation made by quantifiers RSEM, Kallisto and Salmon shows considerably high consistency (Supplementary File 2: Fig. S4), with both Pearson’s and Spearman’s correlation coefficients higher than 0.95 between any of the two quantifiers. The coefficient of variations of the estimated expression inferenced from these quantifiers for simulated and experimental datasets are shown in Supplementary File 1: Table S7, which also show high consistency across different quantifiers.

Impact of assembly completeness

In this study, the influence of de novo transcriptome assembly on expression quantification is mainly discussed with respect to two major issues: assembly completeness and sequence ambiguity. We would like to primarily look into the impact of assembly completeness on quantification in this section. In order to reduce the possible effect of sequence ambiguity generated by the assemblers, only the unique contigs (contigs in a connected component containing only itself) that are assigned with a single transcript were examined here. The unique contigs were further categorized into full-length, incompleteness and over-extension (see Methods for detailed definitions). The reliability of quantification was examined based on the relative error for contigs with different extent of assembly completeness (Fig. 3, Fig. S5).

Figure 3
figure 3

Quantification Errors for Unique Sequences. The box plots illustrate the relative quantification errors for unique contigs on the simulated datasets. The estimation of contig abundance is made by Kallisto based on Trinity assembly. The contigs are grouped by the extent of assembly completeness, and the numbers on the X-axis represent the lower bound of differences in length. For instance, the contigs located on −10 means that the percentage of difference in length is in the range of \([\,-\,10,\,0)\). The data is color-coded based on the contig categories. The box plots suggest that the estimation made on full-length contigs yield the smallest relative errors, while the incomplete contigs show over-estimation and over-extended contigs show under-estimation on quantification.

In summary, full-length contigs show the lowest relative error of quantification, with the medians of error smaller than ±10%. The scatter plots and correlation coefficients also suggest that the estimated abundance of full-length contigs is highly reliable, with Pearson’s and Spearman’s correlation coefficients between the estimated and ground truth abundance both larger than 0.97 in all the datasets (Figs 4, 5 and S6). The incomplete contigs yield slight over-estimation on the expression abundance. Overall, the quantification errors gradually increased as the assembly completeness decreased. This phenomenon can be observed more obviously on the dog and mouse datasets. When compared with the full-length contigs, the correlation coefficients are comparatively lower, ranging from 0.70 to 0.94 (Fig. 5). Lastly, for the category of over-extended contigs, the quantifiers underestimated the expression abundance and the correlation coefficients slightly dropped (Fig. 4). Nevertheless, the number of over-extended contigs is much fewer than those of other categories (Supplementary File 1: Table S6), which indicates that the assemblers did not overly extend the assembled contigs in most of the cases. In other words, only a limited number of contigs in the quantification will be affected in the practical RNA-Seq analysis. Although the TPM metrics has already been normalized for sequence length and total nucleotides, researchers might still need to be aware of the length bias caused by incomplete or over-extended contigs while using TPM as the metrics to estimate the expression of contigs. The results can be observed from normal datasets and datasets with higher sequencing depth.

Figure 4
figure 4

Scatter plots of Estimated Abundance and Ground Truth Expression for Unique Sequences. The scatter plots illustrate the estimated and ground truth abundance for contigs categorized as full-length, incompleteness and over-extension of the simulated dog datasets. The estimation of contig abundance is made by Kallisto based on Trinity assembly. The metrics are recorded in \(lo{g}_{2}(TPM+1)\). The data points are color-coded based on the relative quantification errors, with blue represents under-estimation and orange for over-estimation. In general, the estimation on expression for full-length contigs is highly reliable. There are some incomplete contigs with over-estimated abundance. Moreover, the correlation coefficients for the estimation of incomplete contigs are also relatively lower than that of full-length contigs. As for over-extended contigs, a marginal under-estimation in quantification can be observed.

Figure 5
figure 5

Correlation Coefficients between Estimated Abundance and Ground Truth Expression for Unique Sequences. The figures illustrate the Pearson’s and Spearman’s correlation coefficients between estimated abundance and ground truth expression. In general, the estimation based on full-length contigs have considerably high correlation with the ground truth expression of corresponding transcripts. In contrast, the incomplete and over-extended contigs show relatively lower correlation coefficients. There are significantly low correlation coefficients in the rnaSPAdes assembly based on simulated yeast data; however, due to a small number of data (n = 11), it should be careful to draw such conclusion based on this dataset.

Impact of sequence ambiguity

In this section, the impact of sequence ambiguity on quantification was thoroughly discussed. Similarly, to reduce the compound effect from assembly completeness, only the accurately assembled contigs (accuracy ≥ 90) were examined here. In the first part of this sub-section, we looked through the reliability of quantification when the assemblers report only one contig for many similar transcripts, denoted as family-collapse. In the second part, we examined the impact of contigs with similar sequence content which are assigned with the same transcript, denoted as duplication (see Methods for detailed definition). By using these contigs, we examined the behavior of the quantification algorithms while sequence ambiguity is present in the assembly.

For the contigs categorized as family-collapse, it is much more difficult to analyze the accuracy of quantification because multiple transcripts being assigned to a contig. Based on our observation, there are in average 2 to 3.16 transcripts being assigned to a family-collapse contig across the six datasets. Since there is only one contig that was assigned by multiple transcripts, we would like to find out of which transcript expression delivers the estimated abundance of the contig actually reflects. To our surprise, the estimated abundance is closer to the transcript with the highest expression rather than the one with highest alignment score. However, this might be due to the fact that the assemblers failed to differentiate the family-collapse transcripts; therefore, the quantifiers tend to allocate all the reads to the corresponding contig and estimate the contig abundance close to the sum of all family-collapse transcripts (Fig. 6, Supplementary File 2: Fig. S7). This conclusion can be drawn from both datasets with normal sequencing depth and datasets with higher sequencing depth.

Figure 6
figure 6

Scatter plots of Estimated Abundance and Ground Truth Expression for Family-Collapse Sequences. The scatter plots illustrate the estimated and ground truth abundance for contigs categorized as family-collapse of the simulated dog dataset. The estimation of contig abundance is made by Kallisto based on Trinity assembly. The metrics are recorded in \(lo{g}_{2}(TPM+1)\). The data points are color-coded based on the relative quantification errors, with blue represents under-estimation and orange for over-estimation. Since there are more than one transcript correspond to one contig, we categorized the expression of corresponding transcript into (1) transcript with the maximum alignment score with respect to the contig, (2) transcript with the highest expression in the family, and (3) total expression of connected component. In general, the estimated abundance of the contig actually reflect to the total expression of the connected component of corresponding transcripts.

In contrast with family-collapse, duplication represents the redundant contigs which are clustered into the same connected component and are assigned with a single transcript. Here, we use the maximum estimated abundance in the connected component to investigate the behavior of quantifiers (see Methods for detailed definition). We observed that quantification algorithms tend to allocate most of the RNA reads to a single contig within the connected component in most of the cases (15 among 18 of the datasets with normal sequencing depth and 11 among 15 of the (H) datasets show that over 50% of the connected component has one contig with the proportion of estimated abundance over 75%) (Supplementary File 2: Fig. S8). Furthermore, we would like to understand which estimated abundance of the contig in the connected component can accurately reflect the expression of corresponding transcript. Here, we used three approach to select the estimated abundance of the contig in the connected component: (1) the contigs with the highest alignment score, (2) the contigs with the highest RPEA and (3) the total expression of connected component of the contigs. Consequently, we found that the estimated abundance of contigs that were allocated with the most amount of RNA reads in the connected component show significantly low quantification error with the transcript expression. In the cases when the quantifiers distribute the RNA reads evenly to the duplicated contigs, the ground truth expression for the transcripts cannot be accurately represented by the estimated abundance of contigs (Fig. 7). To address this problem, it is advisable to use the total expression of the connected component for duplicated contigs to measure the expression of corresponding transcripts.

Figure 7
figure 7

Box Plots for the Relative Errors of Duplicated Contigs. The box plots illustrate the relative quantification errors for duplicated contigs of the experimental mouse dataset. The contigs are grouped by the maximum RPEA in the connected component, and the numbers on the X-axis represent the lower bound of the proportion. For instance, the contigs located on 0.45 means that the maximum RPEA of the connected component is in the range of \([0.45,\,0.50)\). Since there are more than one contigs that are assigned by the same transcript, we would like to find out which contigs’ estimated abundance can accurately reflect the expression of the transcript. Here, we categorized the quantification errors into three categories: (1) transcript is assigned to the contig with the highest alignment score, (2) transcript is assigned to the contig that are allocated with the most RNA reads and (3) transcript expression adopts the total expression of the connected component of the associated contigs. The box plots suggest that contig with the highest alignment score or the highest estimation made within the connected component have considerably lower quantification errors if most of the reads are assigned to one specific contig (higher maximum RPEA). However, when the quantifiers allocate the RNA reads evenly to the contigs within the connected component, it is advisable to use the total expression of the connected component instead in order to get the accurate estimation for the expression of transcripts.

Discussions

Based on the observations discussed in the previous section, we first found that the incomplete and over-extended contigs has unreliable estimation of transcript abundance. This observation is similar to the research conducted by Runxuan Zhang et al.40, which suggested that variations in the length of 5′ and/or 3′ UTR considered in transcript quantification often affect accuracy of abundance estimation. In addition, we discovered that once the assemblers failed to distinguish the RNA reads generated from similar transcripts and reported a single merged contig, the estimated abundance of family-collapse contig often reflect the total expression of the collapsed transcripts, and thus is usually close to the transcript generating the most amount of RNA reads. In contrast, to estimate the expression of the transcript associated with duplicated contigs, researchers are suggested to use the total abundance of the contigs in a connected component in order to get accurate estimation. Nevertheless, in most of the practical RNA-Seq analysis, the information of the transcriptome sequence is not available for non-model organisms and the researchers have to rely on de novo prediction of the functional annotation to compensate the lack of information. Moreover, this approach is often biased by the annotation content in the database. Therefore, it is challenging to detect whether family-collapse contigs or duplicated contigs emerge when performing contig annotation (transcript assignment) after assembly. Since it is difficult to identify the correct annotation of these erroneous contigs, we suggest that the researchers should use the sum of estimated abundance of contigs in the same connected component to estimate the gene-level expression instead of looking into sequence level expression (Fig. 8). Despite the fact that this strategy abandons quantifying individual transcripts, many advantages emerge such as higher accuracy on the quantification, reliable read inference, robust statistical performance and clear interpretation of the data41. More importantly, this strategy does not have to deal with the problem resulting from multiple transcript assignment in contig annotation. Most of the observations on the effect of de novo transcriptome assembly on quantification in this study are consistent across datasets with different sequencing depths (Supplementary Materials 2: Fig. S9). In addition, the analysis conducted in this study is based on non-polypoid species (yeast, dog and mouse). For research conducted on polypoid species, one should be aware that the increased complexity of genome might affect the performance of both assembly and quantification algorithms.

Figure 8
figure 8

Scatter plots of Estimated Abundance and Ground Truth Expression for Component level quantification. The scatter plots illustrate the estimated and ground truth abundance for all the contigs of the simulated dog dataset. The estimation of contig abundance is made by Kallisto based on Trinity assembly. The metrics are recorded in \(lo{g}_{2}(TPM+1)\). The data points are color-coded based on the relative quantification errors, with blue represents under-estimation and orange for over-estimation. Here, we compared the performance of component level quantification and sequence level quantification for all the valid assignment of transcripts for each contig. In general, the estimated abundance for the component level quantification yields a more accurate estimation.

We argue that the difficulties discussed in this research emerge mainly because the goals for each step of the analysis is not specifically designed for de novo RNA-Seq analysis. The researchers should be aware that these programs could be biased by the problems that these programs were designed to solve. For instance, most of the assemblers are optimized to reconstruct the whole transcriptome, which sometimes leads to many false predictions on SNPs or isoforms. These artificial or incorrect contigs thereafter deteriorate the accuracy of quantifiers because most of the quantification algorithms infer the expression based on the overlapping relation between RNA-Seq reads and the given sequences (transcripts or contigs). Furthermore, the annotation step is performed based on sequence alignment without considering the abundance of expression. This observation demonstrates that the selection of the programs has a strong impact on the de novo RNA-Seq analysis, especially for the selection of transcriptome assemblers. For instance, to minimize the effect of assembly completeness on quantification, assemblers that construct the sequences with appropriate length are preferable. On the other hand, the assemblers that report fewer false predictions for SNPs or isoforms might reduce the problem of duplicated contigs in contig annotation resulting from sequence ambiguity. The performance for the quantifiers we analyzed throughout this study demonstrates high consistency in terms of accuracy of quantification. Therefore, we recommended to use quantifiers such as Kallisto and Salmon for significantly lower computational time16,18. The selection of programs for contig annotation is relatively irrelevant because the bottleneck mainly lies in the insufficient transcript information. Since there are no reference transcripts available in the practical de novo RNA-Seq analysis, most of the research utilizes the protein sequences from closely related species because of comparatively higher conservation in protein sequences. However, to align the contigs with proteins of other species might reduce the precision and accuracy of sequence alignment, which makes it more difficult to find the correct annotation for the assembled contigs.

There are many other factors that might as well influence the quantification quality in the practical de novo RNA-Seq analysis. For instance, read length, fragment size, strand specificity, sequence specific bias and positional bias. The read length or the fragment size directly determine the maximum length of the nucleotides that overlap with the reference sequences for each read pair. Therefore, the number of RNA-Seq reads that can be aligned to multiple origins of the transcripts reduces when longer reads are adopted, which mitigates the problem of sequence ambiguity on the inference of the origin of the reads35,36. The strand specificity provides the information for the strand of the RNA-Seq reads, which is expected to improve the precision of sequence alignment and quantification42. Last but not least, the sequence-specific and positional bias derived from library construction might lead to RNA-Seq reads that over- or under-represent the number of transcripts in the molecules. Therefore, it is important to model the fragment bias in the process of quantification in the practical RNA-Seq analysis43.

Lastly, we would like to highlight a sequencing technology that might provide another new perspective and mitigate the problems in RNA-Seq analysis: long-read RNA-Seq using the third-generation sequencing technology. Long-read RNA sequencing generates a single read for each mRNA molecule in real-time, which results in considerably longer RNA-Seq data that allows the full-length reconstruction for transcripts without the need of assembly9. Although the demanding cost for sequencing in higher coverage makes it hard to be considered for quantification at this moment, this technology still provides an extraordinary breakthrough for identifying the transcriptome in non-model organisms44. If the research expenditure is sufficient, we recommend using the long-read RNA-Seq reads for the identification of the novel transcriptome.

Conclusion

While most of the related studies focused on optimizing the quantification or assembly algorithms independently, few studies have discussed how the erroneous contigs generated by the assemblers affect the downstream analysis of RNA-Seq. In this study, we examined the impact of assembly completeness and sequence ambiguity. We comparatively evaluated the performance of rnaSPAdes, Trans-ABySS and Trinity for de novo transcriptome assembly under three transcriptomes with different complexities. All of the selected assemblers showed a lower proportion of the fully-reconstructed transcripts as the number of unique sequences in the transcriptome decreases. In general, rnaSPAdes constructed the least number of contigs with the highest TransRate score, Trinity produced longer contigs, and Trans-ABySS generated the contigs with higher accuracy. As for quantification, we measured the reliability of RSEM, Kallisto and Salmon. The estimation made by three algorithms shows marginal differences. For each erroneous contig, the incomplete or over-extended contigs lead to unreliable estimation of the abundance of contigs. Moreover, we have found that if the redundant contigs are presented in the assembly, the quantifiers tended to allocate the RNA-Seq reads to one of the duplicated contig. However, in rare cases, the quantifiers distributed the reads evenly to the contigs that share similar sequence content. On the contrary, the quantifiers tended to over-estimate the contigs that were assigned with multiple transcripts since the assemblers failed to distinguish the difference of these transcripts and reported only a single contig. To circumvent these issues, it is advisable to estimate the abundance on component-level rather than for individual transcript. By exploring how these factors deteriorate the reliability of de novo RNA-Seq analysis, we provided valuable insights for the interplay between transcriptome assembly, quantification and sequence annotation. We anticipated these discoveries will be useful in the future development of assembly or quantification programs.