Over 70% of the human genome can be transcribed, yet never all in the same cell type, according to recent ENCODE project results, and a wealth of long, small, micro and even circular RNAs have been found in nature. To study this remarkable diversity of transcripts researchers have developed many techniques, but none is as comprehensive and powerful as high-throughput sequencing of RNA (RNA-seq) (News and Views, p. 882). This technology, particularly its use in large-scale collaborative studies, is the subject of this Focus on RNA sequencing quality control (SEQC).

Three of the five studies in the Focus report results from the SEQC project, an effort led by the Microarray Quality Control (MAQC) Consortium of the US Food and Drug Administration to evaluate the comparability of RNA-seq data across laboratories and to assess different sequencing platforms and data analysis approaches and their performance relative to DNA microarrays.

The main results from the project reveal that different groups using RNA-seq to discover splice junctions and analyze differential gene expression—which genes go up in expression and which go down—can obtain robust and reproducible results when the appropriate bioinformatic methods are used, but measurement of absolute levels of gene expression and profiling of the various transcript isoforms remain challenging (Article, p. 903). An independent study by the Association of Biomolecular Resource Facilities (ABRF) performed complementary analyses using MAQC samples but included long-read and semiconductor-based sequencers and also evaluated different RNA-preparation protocols (Article, p. 915). Notably, the ABRF identify effective protocols for doing RNA-seq on degraded RNA samples, opening new possibilities for transcriptome analysis of formalin-fixed, paraffin-embedded tissues, which will be important for application of the technology to the clinic.

Measuring the concordance between RNA-seq and microarray measurements of gene expression is another important question for the field, especially if legacy microarray data are going to be compared with data generated from RNA-seq. A second paper from the SEQC initiative analyzes gene expression in samples from the livers of rats exposed to chemical toxins and finds that the level of agreement between RNA-seq and microarray data depends on the strength of the perturbation elicited by the chemical agent (Article, p. 926).

Proper comparison of multiple RNA-seq data sets requires normalization to account for biases introduced by different sequencing sites, instruments and protocols. The third manuscript from the SEQC project assesses a collection of normalization approaches and finds some that work well for study designs where samples are distributed among several sequencing sites (Analysis, p. 888). Externally added, spike-in control RNAs mixed into each sample should facilitate normalization, but, in practice, assay biases can affect spike-in RNAs differently from other RNAs in a sample; this necessitated the development of the first algorithmic approach that accounts for these biases in spike-in data (Analysis, p. 896). Collectively, these studies of normalization, along with a new tool for assessment and visualization of data from samples containing spike-ins from the External RNA Control Consortium (ERCC) (Munro et al. Nat. Commun., in the press), should make spike-in controls easier to use.

Although much work remains to be done, the multiplatform, cross-site studies reported here represent initial steps toward RNA-seq analysis of large cohorts for discovery research and clinical use (News and Views, p. 884). The future may bring new developments in robotics and ‘library preparation as a service’ as strategies to increase reproducibility of RNA-seq (News and Views, p. 882) and the need to define new concepts, such as 'transcripts of unknown significance’ to complement the existing genomic ‘variants of unknown significance' in clinical analyses (News and Views, p. 884).