bam.iobio: a web-based, real-time, sequence alignment file inspector

Miller, Chase A; Qiao, Yi; DiSera, Tonya; D'Astous, Brian; Marth, Gabor T

doi:10.1038/nmeth.3174

Download PDF

Correspondence
Published: 25 November 2014

bam.iobio: a web-based, real-time, sequence alignment file inspector

Chase A Miller^1,2,3,
Yi Qiao^1,2,3,
Tonya DiSera^2,3,
Brian D'Astous¹^nAff4 &
…
Gabor T Marth^1,2,3

Nature Methods volume 11, page 1189 (2014)Cite this article

5675 Accesses
25 Citations
41 Altmetric
Metrics details

Subjects

To the Editor:

The analysis of big genomic data sets today engenders an all-or-nothing approach, i.e., complete, end-to-end analysis, which is time consuming and unintuitive; it also requires considerable computational expertise and costly computer infrastructure, effectively excluding many bench biologists from genome-scale analyses. We have developed and are continually expanding a web-based analysis system, iobio (http://iobio.io/), to empower all biological researchers to analyze—easily, interactively and in a visually driven manner—large biomedical data sets that are essential for their research, without onerous resource requirements. A primary example of genome-scale 'big data' is the BAM¹ format DNA sequence alignment file. BAM files underlie diverse types of genetic analyses, acting as the universal currency of high-throughput sequence analysis. Here we report the first complete iobio web app, bam.iobio (Fig. 1; http://bam.iobio.io/), an open-source dashboard web application providing an insightful overview of the contents of these large, non–human-readable BAM files and enabling users to further analyze their alignments, all in real time.

**Figure 1: The bam.iobio.io web application.**

The user selects a BAM file either hosted remotely or from his or her own computer's hard drive, and then our app calculates and displays, within a few seconds, crucial information about the sequence alignment: (i) the average read coverage and its distribution, (ii) the composition of the data set according to read length, (iii) the fragment-length average, distribution and outliers, (iv) the histogram of base quality values (to identify a bad sequencing run) and read duplication rate (to identify low library complexity), and (v) the histogram of mapping quality values and fraction of properly mapped read pairs (to identify poor mapping results).

Collecting such vital alignment statistics using current tools requires placing the BAM file on a Unix machine and then installing and running Unix programs such as SAMTools¹ or BamTools² on the entire BAM file. This process may take hours to complete, e.g., the 18-gigabyte BAM file in our tests took 8 hours to process (Supplementary Table 1). In contrast, our approach is to collect a random sample of the read alignments (Supplementary Fig. 1) to accurately estimate the same alignment statistics in seconds (Supplementary Fig. 2). Notably, sampling takes place where the BAM file is stored (i.e., on cloud storage or a user's hard drive), and only the sampled data—a tiny fraction of the entire BAM file—are ever transmitted. The alignments are then streamed to data analysis web services that produce appropriate alignment statistics in seconds before transmitting these to bam.iobio for visualization (for implementation details, licensing and deployment considerations, see Supplementary Note; for system compatibility, see Supplementary Table 2). We can now analyze the same 18-gigabyte alignment file in <10 seconds. Real-time visualization allows the user to experience how the statistical distributions progressively converge and become stable as sampled alignment data are collected. The user can further explore the data interactively by selecting other chromosomes or chromosomal subregions, using the main read coverage panel for navigation.

This web app puts forward an interactive and intuitive genomic data analysis paradigm that is not achievable with existing systems, enabling users to analyze both local and remotely stored data, without tool installation or transmitting large data sets, and immediately see informative results. We are developing other real-time analysis applications: for example, to analyze multiple alignment files simultaneously using our sampling approach, and for interactive, complete analysis of genomic data in smaller genomic windows, such as in the region of a gene (demos at http://iobio.io/). We are also creating software libraries for third-party developers to build similar interactive web apps. Although large, whole-genome computation will remain essential for many tasks, we expect that web-based, visually driven, real-time tools will offer a powerful new analysis modality for bioinformatics experts and bench scientists alike.

References

Li, H. et al. Bioinformatics 25, 2078–2079 (2009).
Article Google Scholar
Barnett, D.W., Garrison, E.K., Quinlan, A.R., Strömberg, M.P. & Marth, G.T. Bioinformatics 27, 1691–1692 (2011).
Article CAS Google Scholar

Download references

Author information

Brian D'Astous
Present address: Present address: National Public Radio, Washington, DC, USA.,

Authors and Affiliations

Department of Biology, Boston College, Chestnut Hill, Massachusetts, USA
Chase A Miller, Yi Qiao, Brian D'Astous & Gabor T Marth
Department of Human Genetics, University of Utah, Salt Lake City, Utah, USA
Chase A Miller, Yi Qiao, Tonya DiSera & Gabor T Marth
Utah Science Technology and Research Center for Genetic Discovery, University of Utah, Salt Lake City, Utah, USA
Chase A Miller, Yi Qiao, Tonya DiSera & Gabor T Marth

Authors

Chase A Miller
View author publications
You can also search for this author in PubMed Google Scholar
Yi Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Tonya DiSera
View author publications
You can also search for this author in PubMed Google Scholar
Brian D'Astous
View author publications
You can also search for this author in PubMed Google Scholar
Gabor T Marth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gabor T Marth.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Analysis Modalities

The region of the dataset being analyzed is depicted by the dark color. Global Analysis. The entire dataset is analyzed. This analysis typically takes hours to days to complete. Sampling Analysis. Global quantities of the entire dataset are estimated by random sampling. It is possible to complete this analysis in seconds. Regional Analysis. Users analyze a well-defined, small, continuous unit of data from a potentially large dataset, such as genomic sequence alignments in the region of a gene. This analysis can also be completed in seconds.

Supplementary Figure 2 Accuracy of sampling-based estimation of alignment file metrics

Sampling analysis of six metrics performed by BAM.IOBIO.IO is compared to accurate values obtained via end-to-end analysis. (A) Here the percent error of the sampling analysis is shown for each metric as a function of the number of alignment records analyzed. The evaluation of a single alignment file is shown (top) as well as an average of 100 alignment files (bottom) from the 1000 Genomes project. [REF] (B) Estimated values obtained via random sampling of 100,000 reads are plotted against the accurate values of 100 alignment files.

Supplementary information

Supplementary Figures and Tables

Supplementary Figures 1 and 2, Supplementary Tables 1 and 2 and Supplementary Note (PDF 1909 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Miller, C., Qiao, Y., DiSera, T. et al. bam.iobio: a web-based, real-time, sequence alignment file inspector. Nat Methods 11, 1189 (2014). https://doi.org/10.1038/nmeth.3174

Download citation

Published: 25 November 2014
Issue Date: December 2014
DOI: https://doi.org/10.1038/nmeth.3174

This article is cited by

Gene.iobio: an interactive web tool for versatile, clinically-driven variant interrogation and prioritization
- Tonya Di Sera
- Matt Velinder
- Gabor Marth
Scientific Reports (2021)
Galaxy External Display Applications: closing a dataflow interoperability loop
- Daniel Blankenberg
- John Chilton
- Nate Coraor
Nature Methods (2020)
Pan-cancer analysis of whole genomes
- Lauri A. Aaltonen
- Federico Abascal
- Christian von Mering
Nature (2020)
DNAscan: personal computer compatible NGS analysis, annotation and visualisation
- A. Iacoangeli
- A. Al Khleifat
- A. Al-Chalabi
BMC Bioinformatics (2019)
MySeq: privacy-protecting browser-based personal Genome analysis for genomics education and exploration
- Michael D. Linderman
- Leo McElroy
- Laura Chang
BMC Medical Genomics (2019)

bam.iobio: a web-based, real-time, sequence alignment file inspector

Subjects

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Integrated supplementary information

Supplementary Figure 1 Analysis Modalities

Supplementary Figure 2 Accuracy of sampling-based estimation of alignment file metrics

Supplementary information

Supplementary Figures and Tables

Rights and permissions

About this article

Cite this article

This article is cited by

Gene.iobio: an interactive web tool for versatile, clinically-driven variant interrogation and prioritization

Galaxy External Display Applications: closing a dataflow interoperability loop

Pan-cancer analysis of whole genomes

DNAscan: personal computer compatible NGS analysis, annotation and visualisation

MySeq: privacy-protecting browser-based personal Genome analysis for genomics education and exploration

Search

Quick links

Subjects

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Integrated supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links