To the Editor
Applications of deep-sequencing technologies in life science research and clinical diagnostics are rapidly expanding. Although fast data-processing algorithms exist1, intuitive, portable data-evaluation solutions are still needed. Web tools have a history in bioinformatics of providing platform-independent, intuitive, barrier-free software solutions. Whereas in most scientific web tools a server performs intense calculations, the new HTML5 standard and the competition between web browser platforms have recently opened access to computational resources for web apps. However, so far web apps have been used only to visualize existing genome annotations or alignment data2,3. Here we describe BrowserGenome (http://www.BrowserGenome.org), a web-based deep-sequencing data-analysis platform offering barcode deconvolution, read mapping, real-time data visualization, transcript-count analysis and data normalization. BrowserGenome is specifically focused on the evaluation of mRNA-seq data, but it can easily be extended to other applications. BrowserGenome matches the speed and memory footprint of state-of-the-art software while being visually driven and intuitive to use.
Read-mapping, visualization and transcript-counting algorithms were implemented in JavaScript through adaptation of a non-overlapping q-gram indexing algorithm4, sorted data structures and random sampling3 (Supplementary Note 1 and Supplementary Figs. 1 and 2). The read-mapping strategy was specifically designed to allow quantification of gene expression in the limited web browser environment, without aims of splice-variant detection, calling of single-nucleotide polymorphisms or the evaluation of paired-end sequencing data, as offered by other software5. BrowserGenome uses raw sequencing data in FASTQ format or imports mapping results from other software in SAM format. It outputs binary or SAM-format mapping results or transcript-count tables. The graphical user interface displays the genome as a dynamic circle, with the mapping density displayed eccentrically (Fig. 1). The user navigates through the data using a mouse, with gestures similar to those used in web applications such as Google Maps. Reference gene names and exons are displayed at high zoom levels. Up to six hit-density tracks can be loaded in parallel. Wizard menus guide users through the read-mapping and transcript-counting processes (Supplementary Note 2).
To validate the performance of BrowserGenome, we analyzed a publically available mRNA-seq data set from the ENCODE database6 (human HepG2 cells; data set ENCFF000DPK) on a standard laptop computer. We observed that 59.2% of 26.6 million raw reads were mapped to the human genome at a rate of 18 million reads per hour. The hit-density map could be navigated in real time, and normalized transcript counts were calculated in less than two seconds (Supplementary Table 1). Despite BrowserGenome's simple read-mapping algorithm, analyzing the same data with the established STAR5 software produced highly correlated transcript-count data (Pearson R = 0.974; Supplementary Fig. 3) and near-equal correlation coefficients between gene expression results and sequencing-independent gene expression data (Supplementary Fig. 4).
BrowserGenome's usability and accessibility compare favorably with those of other graphics-based RNA-seq evaluation tools (Supplementary Fig. 5). The core functions can be easily extended or incorporated into other web apps through a library interface (Supplementary Note 3). The platform-independent web app does not transfer any scientific data via the Internet and is open-source software under the terms of GNU General Public License version 2 without depending on third-party code.
References
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Skinner, M.E., Uzilov, A.V., Stein, L.D., Mungall, C.J. & Holmes, I.H. JBrowse: a next-generation genome browser. Genome Res. 19, 1630–1638 (2009).
Miller, C.A., Qiao, Y., DiSera, T., D'Astous, B. & Marth, G.T. bam.iobio: a web-based, real-time, sequence alignment file inspector. Nat. Methods 11, 1189 (2014).
Ukkonen, E. Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci. 92, 191–211 (1992).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Acknowledgements
We thank H. Hauswedell (Institut für Informatik, Freie Universität Berlin, Berlin, Germany) for critical comments. This work was supported by grants from the German Research Foundation (EXC1023) and the European Research Council to V.H. and from the German National Academic Foundation to J.L.S.-B.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Fast read-mapping algorithm of BrowserGenome.
(a) Indexing strategy: The genome sequence of interest is divided into non-overlapping 12-mers. A Hook table is generated that contains the genome position of the first occurrence of every possible 12-mer sequence. Therefore, this table counts 412 entries and occupies 67 MB of RAM (4 bytes per entry). A second table called a Jump table for every genomic 12-mer position enlists the position wherein the genome the same 12-mer is found the next time. Its number of rows equals the genome length divided by 12. As such, for the human genome 1.1 GB of RAM are occupied (4 bytes per entry). (b) Fast sequence search: From a 25-mer search sequence, 12 overlapping 12-mers are extracted (left panel). For each 12-mer, all genomic occurrences are retrieved by looking them up in the Hook table and then iterating through the Jump table (right panel). At every 12-mer occurrence in the genome, the whole 25-mer search sequence is locally matched to the genome (100% identities, no gaps allowed).
Supplementary Figure 2 Fast visualization and gene-counting algorithms of BrowserGenome.
(a) For visualization of hits in an exemplary viewing range spanning from position 50 to position 80, an unsorted hit list has to be scanned from top to bottom in order to filter visible hits (left panel). BrowserGenome instead uses sorted hit lists (right panel), which allow the localization of only the first and the last hit entry. Localization works in O(log n) time by the algorithm described in Supplementary Note 1. Single comparison operations are color-coded in blue (equal or greater than target) and green (smaller than target). (b) For counting the hit numbers in annotated exonic regions at a genome-wide scale, a naive algorithm would require testing all hit positions for being included in all annotated exons, which would be computationally intense (left panel). BrowserGenome therefore makes use of both sorted exon and hit lists (right panel), which allows genome-wide exon hit counting in seconds by the algorithm detailed in Supplementary Note 1.
Supplementary Figure 3 RNA-seq data mapping performance comparison between STAR and BrowserGenome.org.
(a) The RNA-seq test data set ENCFF000DPK containing 26,642,287 raw deep-sequencing reads retrieved from human HepG2 cells was downloaded from encodeproject.org and was analyzed on two different computers using STAR 2.4.2a or BrowserGenome.org. (b) Correlation of gene expression quantification results from STAR and BrowserGenome.org using the same data set. Plotted are the absolute numbers of reads mapped to individual genes on a semi-logarithmic scale with added jitter. The Pearson correlation coefficient R was calculated after jitter addition and logarithmization.
Supplementary Figure 4 Correlation of transcript-quantification results of STAR and BrowserGenome with nanoString quantification data.
(a,b) ENCODE raw RNA-seq data set ENCFF000DPK was analyzed using STAR (a) or BrowserGenome.org (b) using default parameters. nCounter data of 52 exemplary genes were retrieved from ref. 3. Plotted are the decadic logarithms of the nCounter counts (x-axis) or RPKM values (y-axis) incremented by 10 and 0.1, respectively, in order to omit infinite numbers. Pearson correlation coefficients R were calculated after logarithmization.
Supplementary Figure 5 Feature comparison of BrowserGenome with three established graphics-based RNA-seq data-evaluation software tools.
(a) Current versions of Galaxy, CLC genomics workbench, and Chipster were compared to BrowserGenome with regard to the features given in the first column.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–5, Supplementary Notes 1–3 (PDF 2039 kb)
Supplementary Table 1
Normalized transcript counts calculated from RNA-seq data of human HepG2 cells using BrowserGenome. (XLSX 1040 kb)
Rights and permissions
About this article
Cite this article
Schmid-Burgk, J., Hornung, V. BrowserGenome.org: web-based RNA-seq data analysis and visualization. Nat Methods 12, 1001 (2015). https://doi.org/10.1038/nmeth.3615
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.3615
This article is cited by
-
Design principles for cyclin K molecular glue degraders
Nature Chemical Biology (2024)
-
Engineering of CRISPR-Cas12b for human genome editing
Nature Communications (2019)