Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Spliced synthetic genes as internal controls in RNA sequencing experiments


RNA sequencing (RNA-seq) can be used to assemble spliced isoforms, quantify expressed genes and provide a global profile of the transcriptome. However, the size and diversity of the transcriptome, the wide dynamic range in gene expression and inherent technical biases confound RNA-seq analysis. We have developed a set of spike-in RNA standards, termed 'sequins' (sequencing spike-ins), that represent full-length spliced mRNA isoforms. Sequins have an entirely artificial sequence with no homology to natural reference genomes, but they align to gene loci encoded on an artificial in silico chromosome. The combination of multiple sequins across a range of concentrations emulates alternative splicing and differential gene expression, and it provides scaling factors for normalization between samples. We demonstrate the use of sequins in RNA-seq experiments to measure sample-specific biases and determine the limits of reliable transcript assembly and quantification in accompanying human RNA samples. In addition, we have designed a complementary set of sequins that represent fusion genes arising from rearrangements of the in silico chromosome to aid in cancer diagnosis. RNA sequins provide a qualitative and quantitative reference with which to navigate the complexity of the human transcriptome.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Rent or buy this article

Prices vary by article type



Prices may be subject to local taxes which are calculated during checkout

Figure 1: Schematic overview illustrating the design and use of RNA sequins.
Figure 2: Using RNA sequins to assess transcript assembly.
Figure 3: Expression and alternative splicing of RNA sequins.
Figure 4: Differential gene expression of sequin mixtures between samples.
Figure 5: Overview of fusion RNA sequins.

Accession codes

Primary accessions

Gene Expression Omnibus


  1. Carninci, P. et al. The transcriptional landscape of the mammalian genome. Science 309, 1559–1563 (2005).

    Article  CAS  PubMed  Google Scholar 

  2. Kapranov, P., Willingham, A.T. & Gingeras, T.R. Genome-wide transcription and the implications for genomic organization. Nat. Rev. Genet. 8, 413–423 (2007).

    Article  CAS  PubMed  Google Scholar 

  3. Kratz, A. & Carninci, P. The devil in the details of RNA-seq. Nat. Biotechnol. 32, 882–884 (2014).

    Article  CAS  PubMed  Google Scholar 

  4. Cloonan, N. et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods 5, 613–619 (2008).

    Article  CAS  PubMed  Google Scholar 

  5. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 (2008).

    Article  CAS  PubMed  Google Scholar 

  6. Wilhelm, B.T. & Landry, J.-R. RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. Methods 48, 249–257 (2009).

    Article  CAS  PubMed  Google Scholar 

  7. Martin, J.A. & Wang, Z. Next-generation transcriptome assembly. Nat. Rev. Genet. 12, 671–682 (2011).

    Article  CAS  PubMed  Google Scholar 

  8. Mercer, T.R. et al. Targeted sequencing for gene discovery and quantification using RNA CaptureSeq. Nat. Protoc. 9, 989–1009 (2014).

    Article  CAS  PubMed  Google Scholar 

  9. Vijay, N., Poelstra, J.W., Künstner, A. & Wolf, J.B.W. Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensive in silico assessment of RNA-seq experiments. Mol. Ecol. 22, 620–634 (2013).

    Article  CAS  PubMed  Google Scholar 

  10. Tarazona, S., García-Alcalde, F., Dopazo, J., Ferrer, A. & Conesa, A. Differential expression in RNA-seq: a matter of depth. Genome Res. 21, 2213–2223 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Li, S. et al. Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study. Nat. Biotechnol. 32, 915–925 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Li, S. et al. Detecting and correcting systematic variation in large-scale RNA sequencing data. Nat. Biotechnol. 32, 888–895 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Lahens, N.F. et al. IVT-seq reveals extreme bias in RNA sequencing. Genome Biol. 15, R86 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  14. Rehrauer, H., Opitz, L., Tan, G., Sieverling, L. & Schlapbach, R. Blind spots of quantitative RNA-seq: the limits for assessing abundance, differential expression, and isoform switching. BMC Bioinformatics 14, 370 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  15. Chen, K. et al. The overlooked fact: fundamental need for spike-in control for virtually all genome-wide analyses. Mol. Cell Biol. 36, 662–667 (2015).

    Article  PubMed  Google Scholar 

  16. Munro, S.A. et al. Assessing technical performance in differential gene expression experiments with external spike-in RNA control ratio mixtures. Nat. Commun. 5, 5125 (2014).

    Article  CAS  PubMed  Google Scholar 

  17. Jiang, L. et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21, 1543–1551 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Baker, S.C. et al. The External RNA Controls Consortium: a progress report. Nat. Methods 2, 731–734 (2005).

    Article  CAS  PubMed  Google Scholar 

  19. Conesa, A. et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 13 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Burset, M. & Guigó, R. Evaluation of gene structure prediction programs. Genomics 34, 353–367 (1996).

    Article  CAS  PubMed  Google Scholar 

  24. Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Clark, M.B. et al. Quantitative gene profiling of long noncoding RNAs with targeted RNA sequencing. Nat. Methods 12, 339–342 (2015).

    Article  CAS  PubMed  Google Scholar 

  26. Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods 10, 1177–1184 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Armbruster, D.A. & Pry, T. Limit of blank, limit of detection and limit of quantitation. Clin. Biochem. Rev. 29, S49–S52 (2008).

    PubMed  PubMed Central  Google Scholar 

  28. Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-seq data. Genome Res. 22, 2008–2017 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Risso, D., Ngai, J., Speed, T.P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896–902 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Love, M.I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  32. Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 31, 46–53 (2013).

    Article  CAS  PubMed  Google Scholar 

  33. Mitelman, F., Johansson, B. & Mertens, F. The impact of translocations and gene fusions on cancer causation. Nat. Rev. Cancer 7, 233–245 (2007).

    Article  CAS  PubMed  Google Scholar 

  34. Mertens, F., Johansson, B., Fioretos, T. & Mitelman, F. The emerging complexity of gene fusions in cancer. Nat. Rev. Cancer 15, 371–381 (2015).

    Article  CAS  PubMed  Google Scholar 

  35. Stransky, N., Cerami, E., Schalm, S., Kim, J.L. & Lengauer, C. The landscape of kinase fusions in cancer. Nat. Commun. 5, 4846 (2014).

    Article  CAS  PubMed  Google Scholar 

  36. Tembe, W.D. et al. Open-access synthetic spike-in mRNA-seq data for cancer gene fusions. BMC Genomics 15, 824 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    Article  CAS  PubMed  Google Scholar 

  38. Naumann, S., Reutzel, D., Speicher, M. & Decker, H.-J. Complete karyotype characterization of the K562 cell line by combined application of G-banding, multiplex-fluorescence in situ hybridization, fluorescence in situ hybridization, and comparative genomic hybridization. Leuk. Res. 25, 313–322 (2001).

    Article  CAS  PubMed  Google Scholar 

  39. Maher, C.A. et al. Chimeric transcript discovery by paired-end transcriptome sequencing. Proc. Natl. Acad. Sci. USA 106, 12353–12358 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Zhao, W. et al. Comparison of RNA-Seq by poly (A) capture, ribosomal RNA depletion, and DNA microarray for expression profiling. BMC Genomics 15, 419 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  41. SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat. Biotechnol. 32, 903–914 (2014).

  42. Rapaport, F. et al. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 14, R95 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Engström, P.G. et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat. Methods 10, 1185–1191 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  44. Van Keuren-Jensen, K., Keats, J.J. & Craig, D.W. Bringing RNA-seq closer to the clinic. Nat. Biotechnol. 32, 884–885 (2014).

    Article  CAS  PubMed  Google Scholar 

  45. Byron, S.A., Van Keuren-Jensen, K.R., Engelthaler, D.M., Carpten, J.D. & Craig, D.W. Translating RNA sequencing into clinical diagnostics: opportunities and challenges. Nat. Rev. Genet. 17, 257–271 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Deveson, I.W. et al. Representing genetic variation with synthetic DNA standards. Nat. Methods (2016).

  47. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    CAS  PubMed  Google Scholar 

  48. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10–12 (2011).

    Article  Google Scholar 

  49. Langmead, B. & Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  51. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Robinson, J.T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Schafer, S. et al. Alternative splicing signatures in RNA-seq data: percent spliced in (PSI). Curr. Protoc. Hum. Genet. 87, 11.16.11–11.16.14 (2015).

    Google Scholar 

  54. Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Anders, S., Pyl, P.T. & Huber, W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).

    Article  CAS  PubMed  Google Scholar 

Download references


The authors would like to thank the following funding sources: Australian National Health and Medical Research Council (NHMRC) Australia Fellowship (1062470 to T.R.M. and 1062606 to W.Y.C.). S.A.H. and I.W.D. are supported by Australian Postgraduate Award scholarships. The contents of the published material are solely the responsibility of the administering institution, a participating institution or individual authors and do not reflect the views of NHMRC. The authors would also like to thank D. Thomson and M. Smith (Garvan Institute of Medical Research) for helpful discussions during manuscript preparation.

Author information

Authors and Affiliations



T.R.M. and J.S.M. conceived the project, designed sequins and in silico chromosome, and conceived experiments. W.Y.C. and S.B.A. performed experimental work. J.B. performed qRT-PCR validation. L.K.N. contributed supervision and manuscript preparation. S.A.H., T.W. and T.R.M. performed bioinformatic analyses. S.A.H., I.W.D. and T.R.M. prepared the manuscript.

Corresponding author

Correspondence to Tim R Mercer.

Ethics declarations

Competing interests

Garvan Institute of Medical Research has filed a patent application (PCT/AU2015/050797) on some techniques described in this study.

Integrated supplementary information

Supplementary Figure 1 Schematic overview of the use of RNA sequins in an RNAseq experiment.

RNA sequins are added directly to an RNA sample prior to library preparation and sequencing (yellow). Resultant reads are aligned to a co-index comprising both the natural reference genome and in silico chromosome (blue). Due to their distinguishing artificial sequence, RNA sequins can be used for quantitative analysis and normalization (red) without interfering with the accompanying sample. RNA sequins enable an assessment of many steps during the RNAseq workflow, with notable examples indicated (dashed boxes).

Supplementary Figure 2 Design of repeat sequences within the in silico chromosome.

(a) Reference annotations indicate repeat elements encoded within the in silico chromosome (chrIS_R), including mobile repeats based on human analogs (e.g. LINEs, SINEs and LTRs) and also tandem repeat domains (e.g. telomeres and centromeres). The inclusion of repetitive elements emulates the distribution of unique sequences across the chromosome that influences alignment. (b) A cumulative frequency distribution shows the repeat density on chrIS_R (red) and human reference genome (blue). (c) Frequency distribution of k-mer sequences within chrIS_R (red) and human reference genome sequence (blue).

Supplementary Figure 3 Design of artificial gene loci that proportionately represent the human transcriptome.

Cumulative frequency histograms compare genetic features encoded on the in silico chromosome (red) to human genes (blue) according to total gene size (a), transcript length (b), exon count (c), exon size (d), intergenic distance (e), GC content (exons) (f), number of isoforms per gene (g) and intron size (h).

Supplementary Figure 4 Sequins exhibit negligible cross-alignment to natural reference genomes.

(a-d) Bar charts indicate the percentage of unique alignments (orange), ambiguous alignments (blue) and reads that did not align (red). (a) No simulated reads from RNA sequins mapped to the human reference genome (hg38), nor to five other eukaryote reference genomes tested (mm10, Mus musculus (mouse); gg4, Gallus gallus (chicken); dr10, Danio rerio (zebrafish); dm6, Drosophila melanogaster (fruit fly); ce10, Caenorhabditis elegans (roundworm)). (b) Negligible fractions of experimental reads from RNA sequins mapped to hg38 (<0.06% of total reads), nor to any of the other five reference genomes. (c) No simulated reads from endogenous human genes (sampled from GENCODE) aligned to chrIS_R. (d) Negligible fractions of experimental reads from endogenous genes (K562 RNAseq library constituting ~152 million reads) aligned to chrIS_R (~0.03% of total reads). (right panel) Venn diagram indicates the number of experimental reads from neat mixture that aligned to chrIS_R, hg38 genome, or both.

Supplementary Figure 5 Validation of sequin transcript assembly using simulated reads.

(a-d) Cumulative distribution plots compare the assembly sensitivity for RNA sequin isoforms (red) and endogenous human genes (blue) with simulated reads. For this simulation, we randomly selected 78 protein-coding genes (comprising 170 isoforms) from GENCODE basic annotation. Sensitivity is measured according to the fraction of correctly assembled nucleotides (a), exons (b), introns (c) and gene loci (d) relative to reference annotations.

Supplementary Figure 6 Validation of sequin transcript assembly using experimental reads.

(a) Schematic diagram (left) showing the assembly of an RNA sequin isoform (R2_81_1; blue) compared to an endogenous gene with matched transcript size and exon count (GNB2L1; red). Reference annotations are shown in color above the observed StringTie assembly (black outline). RNAseq alignment coverage is shown below (grey). Plot (right) illustrates the sensitivity of assembly (base-level) relative to experimental read depth. (b) Individual examples of a further nine RNA sequins (blue) and endogenous genes (red) that had matching length, exon count and sufficient read coverage in accompanying K562 sample.

Supplementary Figure 7 Examples of RNA sequin assembly.

(a) Bar charts indicate the sensitivity (Sn) and specificity (Sp) of isoform assembly using either guided (red) or unguided assembly (blue). (b) Scatter-plot illustrates the sensitivity and sensitivity of isoform assembly relative to sequence depth (decreasing library depths generated by sub-sampling alignments). (c-e) Genome browser view shows commonly observed assembly flaws including incorrect deconvolution of alternative transcription initiation/termination (c), incorrect prediction of boundaries of terminal exons (d), and failure to assemble micro-exons (e). Reference annotations are shown above (yellow) with unguided assembly shown below (black outline).

Supplementary Figure 8 RNA sequins mixture design.

(a,b) Scatter-plots illustrate the expected combination of RNA sequins at staggered concentrations to establish a reference ladder against which to measure gene expression and alternative splicing. Sequin genes are combined at a 2-fold serial dilution, with a minimum three genes per dilution, to span a 106-fold dynamic range in concentration. RNA sequin isoforms are combined at a log2-fold serial dilution to encompass a 32-fold dynamic range in alternative splicing. Scatter-plots indicate the expected dilution of variable (blue) and constant (red) RNA sequins between Mixture A and B. (c) RNA sequin fold changes between Mixture A and B are described by 9 sub-groups at the gene level (left) and 19 sub-groups at the isoform level (right).

Supplementary Figure 9 A diversity of alternative splicing events are encoded within the in silico chromosome.

(a-d) RNA sequins encode a diversity of alternative splicing events, including intron retention (a), exon skipping (cassette exons) (b), alternative transcription initiation/termination (c) and alternative donor/acceptor splice sites (d).

Supplementary Figure 10 Comparison of reads derived from RNA sequins and endogenous RNA.

(a-d) FastQC quality score plots for reads derived from RNA sequins (i.e. aligning to chrIS_R; left) and derived from K562 endogenous genes (i.e. aligning to human reference genome; right). FastQC quality scores on a per-nucleotide (a-b) and per-read (c-d) basis showed an indistinguishable error profile. (e-f) Normalized sequence coverage plots across length of RNA sequins (left; blue) or endogenous human genes (right; red). Thick line indicates mean, error bars indicate S.D.

Supplementary Figure 11 Exon-level assessment of sequin assembly and quantification.

(a) Scatter-plot illustrates the expected PSI (percentage spliced in) relative to observed PSI for exons in highly (left, orange) and lowly (right, green) expressed genes. This illustrates the quantitative accuracy with which alternative splicing events are measured and the expression dependency of this measurement. (b) For exon-level analysis, RNA sequin isoforms were first collapsed into 883 exon counting bins, comprising terminal (red) and internal (blue) exons. Genome browser view shows example R2_42 locus, with three alternative isoforms and 12 exon bins. RNAseq coverage is indicated (top grey histogram). (c,d) Left panels indicate correlation between expected exon concentration and measured expression; right panels indicate the sigmoidal relationship between exon assembly and input concentration. Terminal exons (c; red) exhibited poorer quantitative accuracy and assembly relative to internal exons (d; blue). n = 3 replicates; error bars indicate S.D.

Supplementary Figure 12 Validation of RNA sequin abundance using qRT-PCR.

Observed log2-fold change (LFC) is plotted against expected LFC for a subset of RNA sequins across Mixtures A & B. Results obtained using RNAseq are shown in blue, while qRT-PCR results are shown in red. For the qRT-PCR assay, we targeted a subset of 10 RNA sequins with a range of different expected LFCs, and compared their relative abundances across mixtures.

Supplementary Figure 13 Fold changes between RNA sequin mixtures.

(a-b) Scatter-plots illustrate correlation between expected and observed log2-fold change (LFC) of isoforms (a) and exons (b) between Mixture A and B.

Supplementary Figure 14 Using RNA sequins to assess the diagnosis of fusion genes.

(a) Comparison of detection sensitivity of endogenous fusion genes (n = 9; blue) and size-matched RNA fusion sequins (n = 9; red) using simulated reads. Error bars indicate S.D. (b) Scatter-plot indicates that split reads (individual reads spanning the fusion breakpoint; blue) enjoy a superior correlation with input concentration than do spanning pairs (read pairs aligning to different genes; red). (c) Schematic design of mixture describes fusion sequin isoforms at a 64-fold range in abundance relative to their normal gene counterparts, with fusion/normal pairs encompassing a 128-fold range in expression. (d) Scatter-plot illustrates ‘limit of detection’ (LoD) and quantitative accuracy for fusion sequins (red) and their normal parent genes (orange). Two endogenous gene fusions (green and blue) detected in the accompanying K562 transcriptome are plotted on the scale. (e) Scatter-plot indicates correlation between observed and expected log2-fold change between each normal/fusion gene pair. The NUP214/XKR3 fusion (blue) is plotted on the scale. The BCR/ABL1 fusion lies off the scale, as the fusion transcript was more highly expressed than its parent counterpart. (f) Cumulative frequency histogram indicates the relative split read count between true-positives (yellow) and false-positives (magenta) detected on the in silico chromosome (chrIS_R), thereby informing split read threshold to maximize sensitivity and specificity of fusion gene detection. Endogenous fusions detected in the accompanying K562 transcriptome also indicated (light blue).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–14 (PDF 1806 kb)

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hardwick, S., Chen, W., Wong, T. et al. Spliced synthetic genes as internal controls in RNA sequencing experiments. Nat Methods 13, 792–798 (2016).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing