Figure 1: De Bruijn graph of DNA sequence assembly.
From Computational solutions for omics data
 Bonnie Berger^{1, 2, 4}^{, }
 Jian Peng^{2, 4}^{, }
 Mona Singh^{3, 4}^{, }
 Journal name:
 Nature Reviews Genetics
 Volume:
 14,
 Pages:
 333–346
 Year published:
 DOI:
 doi:10.1038/nrg3433
Additional data
Primary authors
All authors contributed equally to this work.
 Bonnie Berger,
 Jian Peng &
 Mona Singh
Affiliations

Department of Mathematics and Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA.
 Bonnie Berger

Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA.
 Bonnie Berger &
 Jian Peng

Department of Computer Science and the Lewis–Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08542, USA.
 Mona Singh
Competing interests statement
The authors declare no competing interests.
Author details
Bonnie Berger
Bonnie Berger is a professor of mathematics and computer science at the Massachusetts Institute of Technology (MIT), as well as at the Computer Science and Artificial Intelligence Laboratory and the Broad Institute of MIT and Harvard, all in Cambridge, Massachusetts, USA. She received her Ph.D. in computer science and did postdoctoral work at MIT. Her recent work focuses on designing algorithms for genomics and biological networks to gain insights from advances in automated data collection and the subsequent large data sets drawn from them. She is a fellow of the International Society for Computational Biology and Association for Computing Machinery, as well as member of the American Academy of Arts and Sciences.
Search for this author in:
Jian Peng
Jian Peng is a postdoctoral researcher in the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts, USA. He received his Ph.D. in computer science from Toyota Technological Institute, Chicago, Illinois, USA. His research interests include development of computational methods for protein structure modelling, predicting and designing specificity in protein–protein and protein–RNA interactions, and genomics in translational medicine.
Search for this author in:
Mona Singh
Mona Singh is on the faculty at Princeton University, New Jersey, USA, where she is a professor in the Department of Computer Science and in the Lewis–Sigler Institute for Integrative Genomics. She received her Ph.D. in computer science from the Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts, USA and did postdoctoral work at the Whitehead Institute for Biomedical Research, Cambridge, Massachusetts, USA. Her group develops computational methods for analysing cellular networks, predicting specificity in protein interactions, and functionally characterizing proteins using sequence, structure and network approaches.
Search for this author in:
Glossary
 Cloud computing
The use of computing resources distributed in the Internet to store, manage and analyse data, rather than doing so on a local server or personal computer.
 Parallel computing
A form of computation that allows numerous calculations to be carried out simultaneously, thereby accelerating computation. On the basis of this principle, many largescale computational tasks can then be divided into smaller ones and solved on multiple machines concurrently.
 Machine learning techniques
Empirical data are taken as input, the relationship among the data is mathematically or statistically modelled, and patterns or predictions are generated. Supervised learning algorithms infer a function from labelled data features and predict labels on future input; unsupervised learning algorithms model the patterns or the distribution of a given unlabelled data set.
 Parallel dynamic programming
A technique that splits a large dynamic programming problem, usually by filling a table that can avoid redundant calculation, into a number of subproblems and computes all subproblems in parallel using multiple central processing units (CPUs). The computing speedup scales almost linearly with the number of CPUs.
 Multicore computer processing units
(Multicore CPUs). Single computing processors with two or more independent computing units (called cores). Running multiple instructions on multiple cores at the same time can increase the overall speed of programs.
 Cacheoblivious algorithm
Takes advantage of the cache system of the central processing unit (that is, the local memory of frequently accessed data) to avoid expensive memory access operations and thus to improve efficiency; the intrinsic design of these algorithms does not require computer programs to be tuned for machines with different cache systems.
 Linear mixed model
A statistical model that models the observed effects from multiple different hidden factors; the effects are additively mixed according to the proportions of their corresponding factors.
 Matrix factorization
A method for decomposing a matrix into the product of two matrices. It can be applied to identify individual factors involved in a mixed observation.
 Differential geometry
A mathematical discipline for studying geometric objects, such as curves and surfaces, using the techniques of differential and integral calculus.
 Linear programming
A mathematical program for the optimization of a linear objective function, subject to linear constraints. Such functions capture the linear relationship between variables for the problem being optimized.
 Principle component analysis
A tool for transforming a set of observations with correlated variables into a set of linearly independent variables called principle components, making sure that the first principle component accounts for the largest variability of the data.
 Copy number variant
(CNV). Corresponds to abnormal number of copies of one or more segments in the genome. CNVs can be caused by structural rearrangements of the genome such as deletions, duplications, inversions and translocations.
 Bayesian network
A statistical model that describes the distribution of a set of random variables by a directed acyclic graph that represents the relationship among the random variables. For example, in a Bayesian network for a regulatory relationship for a set of genes, each variable represents a gene and each directed edge denotes either activating or repressing regulation between two genes.
 Steiner tree problem
Formulated on a network to find a minimumlength subnetwork that interconnects a set of seed nodes. Any two seed nodes may be connected by an edge or a path through other nodes.
 Random walk
A mathematical formulation of a number of successive random steps on a graph. It has been widely used to explain stochastic observations, such as diffusion in biological networks.
 Eigenvalue problem
The aim of this is to find a nonzero vector (that is, eigenvector), given a square matrix, such that the multiplication of the two is only different by a scalar factor.
 Set cover
Given a set of elements and subsets, the goal is to find the minimum number of subsets that cover all the elements.