Flux balance analysis is a mathematical approach for analyzing the flow of metabolites through a metabolic network. This primer covers the theoretical basis of the approach, several practical examples and a software toolbox for performing the calculations.
Advances in technology across all areas of science have ushered in an era of big data, providing researchers with unprecedented opportunities to understand how biological systems function and interact. Scientists are now faced with the challenge of developing sophisticated computational tools capable of unravelling these data and uncovering important biological signals. Computational biology will continue to play a key role in facilitating multi-disciplinary collaborations, encouraging data sharing and establishing experimental and analytical standards in the life sciences.
Support vector machines (SVMs) are becoming popular in a wide variety of biological applications. But, what exactly are SVMs and how do they work? And what are their most promising applications in the life sciences?
Principal component analysis is often incorporated into genome-wide expression studies, but what is it and how can it be used to explore high-dimensional data?
Clustering is often one of the first steps in gene expression analysis. How do clustering algorithms work, which ones should we use and what can we expect from them?
Mapping the vast quantities of short sequence fragments produced by next-generation sequencing platforms is a challenge. What programs are available and how do they work?
Statistical models called hidden Markov models are a recurring theme in computational biology. What are hidden Markov models, and why are they so useful for so many different problems?
When prioritizing hits from a high-throughput experiment, it is important to correct for random events that falsely appear significant. How is this done and what methods should be used?
Many sequence alignment programs use the BLOSUM62 score matrix to score pairs of aligned residues. Where did BLOSUM62 come from?
Sequence motifs are becoming increasingly important in the analysis of gene regulation. How do we define sequence motifs, and why should we use sequence logos instead of consensus sequences to represent them? Do they have any relation with binding affinity? How do we search for new instances of a motif in this sea of DNA?
A mathematical concept known as a de Bruijn graph turns the formidable challenge of assembling a contiguous genome from billions of short sequencing reads into a tractable computational problem.
The expectation maximization algorithm arises in many computational biology applications that involve probabilistic models. What is it good for, and how does it work?
Artificial neural networks have been applied to problems ranging from speech recognition to prediction of protein secondary structure, classification of cancers and gene prediction. How do they work and what might they be good for?
Programs such as MFOLD and ViennaRNA are widely used to predict RNA secondary structures. How do these algorithms work? Why can't they predict RNA pseudoknots? How accurate are they, and will they get better?
Instrumentation aside, algorithms for matching mass spectra to proteins are at the heart of shotgun proteomics. How do these algorithms work, what can we expect of them and why is it so difficult to find protein modifications?
Bayesian networks are increasingly important for integrating biological data and for inferring cellular networks and pathways. What are Bayesian networks and how are they used for inference?
Decision trees have been applied to problems such as assigning protein function and predicting splice sites. How do these classifiers work, what types of problems can they solve and what are their advantages over alternatives?
How can we computationally extract an unknown motif from a set of target sequences? What are the principles behind the major motif discovery algorithms? Which of these should we use, and how do we know we've found a 'real' motif?
Sequence alignment methods often use something called a 'dynamic programming' algorithm. What is dynamic programming and how does it work?
There seem to be a lot of computational biology papers with 'Bayesian' in their titles these days. What's distinctive about 'Bayesian' methods?
Networks in biology can appear complex and difficult to decipher. Merico et al. illustrate how to interpret biological networks with the help of frequently used visualization and analysis patterns.
Only a subset of single-nucleotide polymorphisms (SNPs) can be genotyped in genome-wide association studies. Imputation methods can infer the alleles of 'hidden' variants and use those inferences to test the hidden variants for association.
Hierarchical models provide reliable statistical estimates for data sets from high-throughput experiments where measurements vastly outnumber experimental samples.
Computational prediction of gene structure is crucial for interpreting genomic sequences. But how do the algorithms involved work and how accurate are they?
How can genome browsers help researchers to infer biological knowledge from data that might be misleading?
Only a subset of genetic variants can be examined in genome-wide surveys for genetic risk factors. How can a fixed set of markers account for the entire genome by acting as proxies for neighboring associations?
The functional composition of microbial community samples from several environments is predicted based on 16S ribosomal RNA gene sequencing data.
New instruments can measure the presence of >30 molecular markers for massive numbers of single cells, but data analysis algorithms have lagged behind. Qiu et al. describe an approach called SPADE for recovering cellular hierarchies from mass or flow cytometry data.
viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia
A new tool to visualize high-dimensional single-cell data, when integrated with mass cytometry, reveals phenotypic heterogeneity of human leukemia.
Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data
When applied in large scale to electronic medical record data, the PheWAS approach replicates GWAS associations and reveals potentially new pleiotropic associations.
An open competition to predict the progression of amyotrophic lateral sclerosis (ALS, also known as Lou Gehrig's disease) disease from the largest database of ALS clinical trial data yields potential new biomarkers and algorithms that outperform human clinicians.
The binding specificities of RNA- and DNA-binding proteins are determined from experimental data using a ‘deep learning’ approach.
MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification
Cox and Mann describe MaxQuant, a suite of algorithms for the analysis of high-resolution mass spectrometry data. The approach achieves substantial improvements in the accuracy of mass measurements and the peptide identification rate.
Reconstructing full-length transcripts from high-throughput RNA sequencing data is difficult without a reference genome sequence. Grabherr et al. describe Trinity, an algorithm for assembling full-length transcripts from short reads without first mapping the reads to a genome sequence.
Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation
RNA-Seq enables rapid sequencing of total cellular RNA and should allow the reconstruction of spliced transcripts in a cell population. Trapnell et al. achieve this and transcript quantification using only paired-end RNA-Seq data and an unannotated genome sequence, and apply the approach to characterize isoform switching over a developmental time course.
Metabolic network modeling in multicellular organisms is confounded by the existence of multiple tissues with distinct metabolic functions. By integrating a genome-scale metabolic network with tissue-specific gene- and protein-expression data, Shlomi et al. adapt constraint-based approaches used for microorganisms to predicting metabolism in ten human tissues. Their computational approach should facilitate interpretation of expression data in the context of metabolic disorders.
Single-molecule sequencing technologies can produce multikilobase-long reads, which are more useful than short reads for assembling genomes and transcriptomes, but their error rates are too high. Koren et al. correct long reads from a PacBio instrument using high-fidelity, short reads from complementary technologies, facilitating assembly of previously intractable sequences.
From the archives
The Bowtie 2 software achieves fast, sensitive, accurate and memory-efficient gapped alignment of sequencing reads using the full-text minute index and hardware-accelerated dynamic programming algorithms.
This Analysis reports a comparison of current software packages for single-molecule localization in localization-based super-resolution imaging. Performance of the participating software on synthetic, biologically inspired ground-truth data was assessed by multiple criteria.
The computational workflow of DIA-Umpire allows untargeted peptide identificationdirectly from DIA (data-independent acquisition) proteomics data without dependence on a spectral library for data extraction
DeepSEA, a deep-learning algorithm trained on large-scale chromatin-profiling data, predicts chromatin effects from sequence alone, has single-nucleotide sensitivity and can predict effects of noncoding variants.
Despite the need for new psychoactive drugs, there are few robust approaches for discovering novel neuroactive molecules. Development of a behavior-based high-throughput screen in zebrafish led to the discovery of molecules with neurological effects. Translating the complex behavioral phenotypes elicited by compounds into a simple barcode enabled identification of their mechanism of action.
Eukaryotic genomes do not exist in vivo as naked DNA, but in complexes known as chromatin. Chromatin contains nucleosomes, short stretches of DNA tightly wrapped around a histone protein core, which exclude most DNA binding proteins and so act as repressors. A combined computational and experimental approach has been used to determine DNA sequence preferences of nucleosomes and to predict genome-wide nucleosome organization. The yeast genome encodes an intrinsic nucleosome organization that explains about half of the in vivo nucleosome positions. Highly conserved across eukaryotes, the code directs transcription factors to their binding sites and facilitates many other specific chromosome functions. An accompanying News and Views piece discusses the role of DNA sequence and other regulators in nucleosome positioning. The cover graphic represents a stretch of chromatin including several nucleosomes.
A natural polypeptide chain can fold into a native protein in microseconds, but predicting such stable three-dimensional structure from any given amino-acid sequence and first physical principles remains a formidable computational challenge. Aiming to recruit human visual and strategic powers to the task, Seth Cooper, David Baker and colleagues turned their 'Rosetta' structure-prediction algorithm into an online multiplayer game called Foldit, in which thousands of non-scientists competed and collaborated to produce a rich set of new algorithms and search strategies for protein structure refinement. The work shows that even computationally complex scientific problems can be effectively crowd-sourced using interactive multiplayer games.
The analysis of protein-interaction networks is essential to an understanding of the regulatory processes in a living cell. Many methods have been developed with a view to predicting protein–protein interactions (PPIs) at a genome-wide level, although the differences obtained using these approaches suggest that there are still factors unaccounted for. Barry Honig and colleagues have developed a new way of predicting PPIs that is based on the proteins' three-dimensional structures and functional data. Tests of several predictions of the new algorithm, known as PREPPI, confirm the accuracy of the results.
Our ability to multitask and our capacity for cognitive control decline linearly as we age. A new study shows that cognitive training can help repair this decline. In older adults aged between 60 and 85 who trained at home by playing NeuroRacer, a custom-designed 3D video game, both multitasking and cognitive control improved, with effects persisting for six months. The benefits of this training extended to untrained cognitive functions such as sustained attention and working memory. These findings suggest that the ageing brain may be more robustly plastic than previously thought, allowing for cognitive enhancement using appropriately designed strategies.
Mark DePristo and colleagues report an analytical framework to discover and genotype variation using whole exome and genome resequencing data from next-generation sequencing technologies. They apply these methods to low-pass population sequencing data from the 1000 Genomes Project.
Eleazar Eskin and colleagues report a variance component model for correcting for sample structure in association studies. The EMMAX program is publicly available and may be used for analysis of genome-wide association study datasets.
Owen Rackham, Jose Polo, Julian Gough and colleagues present a method, Mogrify, for predicting sets of transcription factors that can induce transdifferentiation between cell types. They show that Mogrify is able to predict known factors for published cell conversions and experimentally validate factors for two new conversions.
Identifying molecular predictors of effective vaccination is an important clinical and technical goal. Pulendran and colleagues use a systems biology approach to study human responses to vaccination against influenza and determine the correlates of immunogenicity.
The authors examined neuronal responses in V1 and V2 to synthetic texture stimuli that replicate higher-order statistical dependencies found in natural images. V2, but not V1, responded differentially to these textures, in both macaque (single neurons) and human (fMRI). Human detection of naturalistic structure in the same images was predicted by V2 responses, suggesting a role for V2 in representing natural image structure.
The authors develop a new method to mine genomic cancer data to uncover complex indels. These simultaneous deletions and insertions have been over-looked by previous sequencing data analysis methods, and the Pindel-C algorithm uncovers new information about their potential contribution to tumorigenesis.
Availability of computing power can limit computational analysis of large genetic and genomic datasets. Here, Canela-Xandri, et al. describe a software called DISSECT that is capable of analyzing large-scale genetic data by distributing the work across thousands of networked computers.