Abstract
RNA sequencing (RNA-seq) can be used to assemble spliced isoforms, quantify expressed genes and provide a global profile of the transcriptome. However, the size and diversity of the transcriptome, the wide dynamic range in gene expression and inherent technical biases confound RNA-seq analysis. We have developed a set of spike-in RNA standards, termed 'sequins' (sequencing spike-ins), that represent full-length spliced mRNA isoforms. Sequins have an entirely artificial sequence with no homology to natural reference genomes, but they align to gene loci encoded on an artificial in silico chromosome. The combination of multiple sequins across a range of concentrations emulates alternative splicing and differential gene expression, and it provides scaling factors for normalization between samples. We demonstrate the use of sequins in RNA-seq experiments to measure sample-specific biases and determine the limits of reliable transcript assembly and quantification in accompanying human RNA samples. In addition, we have designed a complementary set of sequins that represent fusion genes arising from rearrangements of the in silico chromosome to aid in cancer diagnosis. RNA sequins provide a qualitative and quantitative reference with which to navigate the complexity of the human transcriptome.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Nano3P-seq: transcriptome-wide analysis of gene expression and tail dynamics using end-capture nanopore cDNA sequencing
Nature Methods Open Access 19 December 2022
-
Library adaptors with integrated reference controls improve the accuracy and reliability of nanopore sequencing
Nature Communications Open Access 28 October 2022
-
Long read sequencing reveals novel isoforms and insights into splicing regulation during cell state changes
BMC Genomics Open Access 10 January 2022
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout





Accession codes
References
Carninci, P. et al. The transcriptional landscape of the mammalian genome. Science 309, 1559–1563 (2005).
Kapranov, P., Willingham, A.T. & Gingeras, T.R. Genome-wide transcription and the implications for genomic organization. Nat. Rev. Genet. 8, 413–423 (2007).
Kratz, A. & Carninci, P. The devil in the details of RNA-seq. Nat. Biotechnol. 32, 882–884 (2014).
Cloonan, N. et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods 5, 613–619 (2008).
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 (2008).
Wilhelm, B.T. & Landry, J.-R. RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. Methods 48, 249–257 (2009).
Martin, J.A. & Wang, Z. Next-generation transcriptome assembly. Nat. Rev. Genet. 12, 671–682 (2011).
Mercer, T.R. et al. Targeted sequencing for gene discovery and quantification using RNA CaptureSeq. Nat. Protoc. 9, 989–1009 (2014).
Vijay, N., Poelstra, J.W., Künstner, A. & Wolf, J.B.W. Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensive in silico assessment of RNA-seq experiments. Mol. Ecol. 22, 620–634 (2013).
Tarazona, S., García-Alcalde, F., Dopazo, J., Ferrer, A. & Conesa, A. Differential expression in RNA-seq: a matter of depth. Genome Res. 21, 2213–2223 (2011).
Li, S. et al. Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study. Nat. Biotechnol. 32, 915–925 (2014).
Li, S. et al. Detecting and correcting systematic variation in large-scale RNA sequencing data. Nat. Biotechnol. 32, 888–895 (2014).
Lahens, N.F. et al. IVT-seq reveals extreme bias in RNA sequencing. Genome Biol. 15, R86 (2014).
Rehrauer, H., Opitz, L., Tan, G., Sieverling, L. & Schlapbach, R. Blind spots of quantitative RNA-seq: the limits for assessing abundance, differential expression, and isoform switching. BMC Bioinformatics 14, 370 (2013).
Chen, K. et al. The overlooked fact: fundamental need for spike-in control for virtually all genome-wide analyses. Mol. Cell Biol. 36, 662–667 (2015).
Munro, S.A. et al. Assessing technical performance in differential gene expression experiments with external spike-in RNA control ratio mixtures. Nat. Commun. 5, 5125 (2014).
Jiang, L. et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21, 1543–1551 (2011).
Baker, S.C. et al. The External RNA Controls Consortium: a progress report. Nat. Methods 2, 731–734 (2005).
Conesa, A. et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 13 (2016).
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
Burset, M. & Guigó, R. Evaluation of gene structure prediction programs. Genomics 34, 353–367 (1996).
Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).
Clark, M.B. et al. Quantitative gene profiling of long noncoding RNAs with targeted RNA sequencing. Nat. Methods 12, 339–342 (2015).
Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods 10, 1177–1184 (2013).
Armbruster, D.A. & Pry, T. Limit of blank, limit of detection and limit of quantitation. Clin. Biochem. Rev. 29, S49–S52 (2008).
Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-seq data. Genome Res. 22, 2008–2017 (2012).
Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).
Risso, D., Ngai, J., Speed, T.P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896–902 (2014).
Love, M.I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 31, 46–53 (2013).
Mitelman, F., Johansson, B. & Mertens, F. The impact of translocations and gene fusions on cancer causation. Nat. Rev. Cancer 7, 233–245 (2007).
Mertens, F., Johansson, B., Fioretos, T. & Mitelman, F. The emerging complexity of gene fusions in cancer. Nat. Rev. Cancer 15, 371–381 (2015).
Stransky, N., Cerami, E., Schalm, S., Kim, J.L. & Lengauer, C. The landscape of kinase fusions in cancer. Nat. Commun. 5, 4846 (2014).
Tembe, W.D. et al. Open-access synthetic spike-in mRNA-seq data for cancer gene fusions. BMC Genomics 15, 824 (2014).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Naumann, S., Reutzel, D., Speicher, M. & Decker, H.-J. Complete karyotype characterization of the K562 cell line by combined application of G-banding, multiplex-fluorescence in situ hybridization, fluorescence in situ hybridization, and comparative genomic hybridization. Leuk. Res. 25, 313–322 (2001).
Maher, C.A. et al. Chimeric transcript discovery by paired-end transcriptome sequencing. Proc. Natl. Acad. Sci. USA 106, 12353–12358 (2009).
Zhao, W. et al. Comparison of RNA-Seq by poly (A) capture, ribosomal RNA depletion, and DNA microarray for expression profiling. BMC Genomics 15, 419 (2014).
SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat. Biotechnol. 32, 903–914 (2014).
Rapaport, F. et al. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 14, R95 (2013).
Engström, P.G. et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat. Methods 10, 1185–1191 (2013).
Van Keuren-Jensen, K., Keats, J.J. & Craig, D.W. Bringing RNA-seq closer to the clinic. Nat. Biotechnol. 32, 884–885 (2014).
Byron, S.A., Van Keuren-Jensen, K.R., Engelthaler, D.M., Carpten, J.D. & Craig, D.W. Translating RNA sequencing into clinical diagnostics: opportunities and challenges. Nat. Rev. Genet. 17, 257–271 (2016).
Deveson, I.W. et al. Representing genetic variation with synthetic DNA standards. Nat. Methods http://dx.doi.org/10.1038/nmeth.3957 (2016).
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10–12 (2011).
Langmead, B. & Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
Robinson, J.T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
Schafer, S. et al. Alternative splicing signatures in RNA-seq data: percent spliced in (PSI). Curr. Protoc. Hum. Genet. 87, 11.16.11–11.16.14 (2015).
Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Anders, S., Pyl, P.T. & Huber, W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).
Acknowledgements
The authors would like to thank the following funding sources: Australian National Health and Medical Research Council (NHMRC) Australia Fellowship (1062470 to T.R.M. and 1062606 to W.Y.C.). S.A.H. and I.W.D. are supported by Australian Postgraduate Award scholarships. The contents of the published material are solely the responsibility of the administering institution, a participating institution or individual authors and do not reflect the views of NHMRC. The authors would also like to thank D. Thomson and M. Smith (Garvan Institute of Medical Research) for helpful discussions during manuscript preparation.
Author information
Authors and Affiliations
Contributions
T.R.M. and J.S.M. conceived the project, designed sequins and in silico chromosome, and conceived experiments. W.Y.C. and S.B.A. performed experimental work. J.B. performed qRT-PCR validation. L.K.N. contributed supervision and manuscript preparation. S.A.H., T.W. and T.R.M. performed bioinformatic analyses. S.A.H., I.W.D. and T.R.M. prepared the manuscript.
Corresponding author
Ethics declarations
Competing interests
Garvan Institute of Medical Research has filed a patent application (PCT/AU2015/050797) on some techniques described in this study.
Integrated supplementary information
Supplementary Figure 1 Schematic overview of the use of RNA sequins in an RNAseq experiment.
RNA sequins are added directly to an RNA sample prior to library preparation and sequencing (yellow). Resultant reads are aligned to a co-index comprising both the natural reference genome and in silico chromosome (blue). Due to their distinguishing artificial sequence, RNA sequins can be used for quantitative analysis and normalization (red) without interfering with the accompanying sample. RNA sequins enable an assessment of many steps during the RNAseq workflow, with notable examples indicated (dashed boxes).
Supplementary Figure 2 Design of repeat sequences within the in silico chromosome.
(a) Reference annotations indicate repeat elements encoded within the in silico chromosome (chrIS_R), including mobile repeats based on human analogs (e.g. LINEs, SINEs and LTRs) and also tandem repeat domains (e.g. telomeres and centromeres). The inclusion of repetitive elements emulates the distribution of unique sequences across the chromosome that influences alignment. (b) A cumulative frequency distribution shows the repeat density on chrIS_R (red) and human reference genome (blue). (c) Frequency distribution of k-mer sequences within chrIS_R (red) and human reference genome sequence (blue).
Supplementary Figure 3 Design of artificial gene loci that proportionately represent the human transcriptome.
Cumulative frequency histograms compare genetic features encoded on the in silico chromosome (red) to human genes (blue) according to total gene size (a), transcript length (b), exon count (c), exon size (d), intergenic distance (e), GC content (exons) (f), number of isoforms per gene (g) and intron size (h).
Supplementary Figure 4 Sequins exhibit negligible cross-alignment to natural reference genomes.
(a-d) Bar charts indicate the percentage of unique alignments (orange), ambiguous alignments (blue) and reads that did not align (red). (a) No simulated reads from RNA sequins mapped to the human reference genome (hg38), nor to five other eukaryote reference genomes tested (mm10, Mus musculus (mouse); gg4, Gallus gallus (chicken); dr10, Danio rerio (zebrafish); dm6, Drosophila melanogaster (fruit fly); ce10, Caenorhabditis elegans (roundworm)). (b) Negligible fractions of experimental reads from RNA sequins mapped to hg38 (<0.06% of total reads), nor to any of the other five reference genomes. (c) No simulated reads from endogenous human genes (sampled from GENCODE) aligned to chrIS_R. (d) Negligible fractions of experimental reads from endogenous genes (K562 RNAseq library constituting ~152 million reads) aligned to chrIS_R (~0.03% of total reads). (right panel) Venn diagram indicates the number of experimental reads from neat mixture that aligned to chrIS_R, hg38 genome, or both.
Supplementary Figure 5 Validation of sequin transcript assembly using simulated reads.
(a-d) Cumulative distribution plots compare the assembly sensitivity for RNA sequin isoforms (red) and endogenous human genes (blue) with simulated reads. For this simulation, we randomly selected 78 protein-coding genes (comprising 170 isoforms) from GENCODE basic annotation. Sensitivity is measured according to the fraction of correctly assembled nucleotides (a), exons (b), introns (c) and gene loci (d) relative to reference annotations.
Supplementary Figure 6 Validation of sequin transcript assembly using experimental reads.
(a) Schematic diagram (left) showing the assembly of an RNA sequin isoform (R2_81_1; blue) compared to an endogenous gene with matched transcript size and exon count (GNB2L1; red). Reference annotations are shown in color above the observed StringTie assembly (black outline). RNAseq alignment coverage is shown below (grey). Plot (right) illustrates the sensitivity of assembly (base-level) relative to experimental read depth. (b) Individual examples of a further nine RNA sequins (blue) and endogenous genes (red) that had matching length, exon count and sufficient read coverage in accompanying K562 sample.
Supplementary Figure 7 Examples of RNA sequin assembly.
(a) Bar charts indicate the sensitivity (Sn) and specificity (Sp) of isoform assembly using either guided (red) or unguided assembly (blue). (b) Scatter-plot illustrates the sensitivity and sensitivity of isoform assembly relative to sequence depth (decreasing library depths generated by sub-sampling alignments). (c-e) Genome browser view shows commonly observed assembly flaws including incorrect deconvolution of alternative transcription initiation/termination (c), incorrect prediction of boundaries of terminal exons (d), and failure to assemble micro-exons (e). Reference annotations are shown above (yellow) with unguided assembly shown below (black outline).
Supplementary Figure 8 RNA sequins mixture design.
(a,b) Scatter-plots illustrate the expected combination of RNA sequins at staggered concentrations to establish a reference ladder against which to measure gene expression and alternative splicing. Sequin genes are combined at a 2-fold serial dilution, with a minimum three genes per dilution, to span a 106-fold dynamic range in concentration. RNA sequin isoforms are combined at a log2-fold serial dilution to encompass a 32-fold dynamic range in alternative splicing. Scatter-plots indicate the expected dilution of variable (blue) and constant (red) RNA sequins between Mixture A and B. (c) RNA sequin fold changes between Mixture A and B are described by 9 sub-groups at the gene level (left) and 19 sub-groups at the isoform level (right).
Supplementary Figure 9 A diversity of alternative splicing events are encoded within the in silico chromosome.
(a-d) RNA sequins encode a diversity of alternative splicing events, including intron retention (a), exon skipping (cassette exons) (b), alternative transcription initiation/termination (c) and alternative donor/acceptor splice sites (d).
Supplementary Figure 10 Comparison of reads derived from RNA sequins and endogenous RNA.
(a-d) FastQC quality score plots for reads derived from RNA sequins (i.e. aligning to chrIS_R; left) and derived from K562 endogenous genes (i.e. aligning to human reference genome; right). FastQC quality scores on a per-nucleotide (a-b) and per-read (c-d) basis showed an indistinguishable error profile. (e-f) Normalized sequence coverage plots across length of RNA sequins (left; blue) or endogenous human genes (right; red). Thick line indicates mean, error bars indicate S.D.
Supplementary Figure 11 Exon-level assessment of sequin assembly and quantification.
(a) Scatter-plot illustrates the expected PSI (percentage spliced in) relative to observed PSI for exons in highly (left, orange) and lowly (right, green) expressed genes. This illustrates the quantitative accuracy with which alternative splicing events are measured and the expression dependency of this measurement. (b) For exon-level analysis, RNA sequin isoforms were first collapsed into 883 exon counting bins, comprising terminal (red) and internal (blue) exons. Genome browser view shows example R2_42 locus, with three alternative isoforms and 12 exon bins. RNAseq coverage is indicated (top grey histogram). (c,d) Left panels indicate correlation between expected exon concentration and measured expression; right panels indicate the sigmoidal relationship between exon assembly and input concentration. Terminal exons (c; red) exhibited poorer quantitative accuracy and assembly relative to internal exons (d; blue). n = 3 replicates; error bars indicate S.D.
Supplementary Figure 12 Validation of RNA sequin abundance using qRT-PCR.
Observed log2-fold change (LFC) is plotted against expected LFC for a subset of RNA sequins across Mixtures A & B. Results obtained using RNAseq are shown in blue, while qRT-PCR results are shown in red. For the qRT-PCR assay, we targeted a subset of 10 RNA sequins with a range of different expected LFCs, and compared their relative abundances across mixtures.
Supplementary Figure 13 Fold changes between RNA sequin mixtures.
(a-b) Scatter-plots illustrate correlation between expected and observed log2-fold change (LFC) of isoforms (a) and exons (b) between Mixture A and B.
Supplementary Figure 14 Using RNA sequins to assess the diagnosis of fusion genes.
(a) Comparison of detection sensitivity of endogenous fusion genes (n = 9; blue) and size-matched RNA fusion sequins (n = 9; red) using simulated reads. Error bars indicate S.D. (b) Scatter-plot indicates that split reads (individual reads spanning the fusion breakpoint; blue) enjoy a superior correlation with input concentration than do spanning pairs (read pairs aligning to different genes; red). (c) Schematic design of mixture describes fusion sequin isoforms at a 64-fold range in abundance relative to their normal gene counterparts, with fusion/normal pairs encompassing a 128-fold range in expression. (d) Scatter-plot illustrates ‘limit of detection’ (LoD) and quantitative accuracy for fusion sequins (red) and their normal parent genes (orange). Two endogenous gene fusions (green and blue) detected in the accompanying K562 transcriptome are plotted on the scale. (e) Scatter-plot indicates correlation between observed and expected log2-fold change between each normal/fusion gene pair. The NUP214/XKR3 fusion (blue) is plotted on the scale. The BCR/ABL1 fusion lies off the scale, as the fusion transcript was more highly expressed than its parent counterpart. (f) Cumulative frequency histogram indicates the relative split read count between true-positives (yellow) and false-positives (magenta) detected on the in silico chromosome (chrIS_R), thereby informing split read threshold to maximize sensitivity and specificity of fusion gene detection. Endogenous fusions detected in the accompanying K562 transcriptome also indicated (light blue).
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–14 (PDF 1806 kb)
Rights and permissions
About this article
Cite this article
Hardwick, S., Chen, W., Wong, T. et al. Spliced synthetic genes as internal controls in RNA sequencing experiments. Nat Methods 13, 792–798 (2016). https://doi.org/10.1038/nmeth.3958
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.3958
This article is cited by
-
Context-aware transcript quantification from long-read RNA-seq data with Bambu
Nature Methods (2023)
-
Nano3P-seq: transcriptome-wide analysis of gene expression and tail dynamics using end-capture nanopore cDNA sequencing
Nature Methods (2023)
-
Long read sequencing reveals novel isoforms and insights into splicing regulation during cell state changes
BMC Genomics (2022)
-
Library adaptors with integrated reference controls improve the accuracy and reliability of nanopore sequencing
Nature Communications (2022)
-
Single-molecule, full-length transcript isoform sequencing reveals disease-associated RNA isoforms in cardiomyocytes
Nature Communications (2021)