All authors contributed equally to this work.
- Bonnie Berger,
- Jian Peng &
- Mona Singh
Department of Mathematics and Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA.
- Bonnie Berger
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA.
- Bonnie Berger &
- Jian Peng
Department of Computer Science and the Lewis–Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08542, USA.
- Mona Singh
Competing interests statement
The authors declare no competing interests.
Bonnie Berger is a professor of mathematics and computer science at the Massachusetts Institute of Technology (MIT), as well as at the Computer Science and Artificial Intelligence Laboratory and the Broad Institute of MIT and Harvard, all in Cambridge, Massachusetts, USA. She received her Ph.D. in computer science and did postdoctoral work at MIT. Her recent work focuses on designing algorithms for genomics and biological networks to gain insights from advances in automated data collection and the subsequent large data sets drawn from them. She is a fellow of the International Society for Computational Biology and Association for Computing Machinery, as well as member of the American Academy of Arts and Sciences.
Jian Peng is a postdoctoral researcher in the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts, USA. He received his Ph.D. in computer science from Toyota Technological Institute, Chicago, Illinois, USA. His research interests include development of computational methods for protein structure modelling, predicting and designing specificity in protein–protein and protein–RNA interactions, and genomics in translational medicine.
Mona Singh is on the faculty at Princeton University, New Jersey, USA, where she is a professor in the Department of Computer Science and in the Lewis–Sigler Institute for Integrative Genomics. She received her Ph.D. in computer science from the Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts, USA and did postdoctoral work at the Whitehead Institute for Biomedical Research, Cambridge, Massachusetts, USA. Her group develops computational methods for analysing cellular networks, predicting specificity in protein interactions, and functionally characterizing proteins using sequence, structure and network approaches.
- Cloud computing
The use of computing resources distributed in the Internet to store, manage and analyse data, rather than doing so on a local server or personal computer.
- Parallel computing
A form of computation that allows numerous calculations to be carried out simultaneously, thereby accelerating computation. On the basis of this principle, many large-scale computational tasks can then be divided into smaller ones and solved on multiple machines concurrently.
- Machine learning techniques
Empirical data are taken as input, the relationship among the data is mathematically or statistically modelled, and patterns or predictions are generated. Supervised learning algorithms infer a function from labelled data features and predict labels on future input; unsupervised learning algorithms model the patterns or the distribution of a given unlabelled data set.
- Parallel dynamic programming
A technique that splits a large dynamic programming problem, usually by filling a table that can avoid redundant calculation, into a number of subproblems and computes all subproblems in parallel using multiple central processing units (CPUs). The computing speed-up scales almost linearly with the number of CPUs.
- Multicore computer processing units
(Multicore CPUs). Single computing processors with two or more independent computing units (called cores). Running multiple instructions on multiple cores at the same time can increase the overall speed of programs.
- Cache-oblivious algorithm
Takes advantage of the cache system of the central processing unit (that is, the local memory of frequently accessed data) to avoid expensive memory access operations and thus to improve efficiency; the intrinsic design of these algorithms does not require computer programs to be tuned for machines with different cache systems.
- Linear mixed model
A statistical model that models the observed effects from multiple different hidden factors; the effects are additively mixed according to the proportions of their corresponding factors.
- Matrix factorization
A method for decomposing a matrix into the product of two matrices. It can be applied to identify individual factors involved in a mixed observation.
- Differential geometry
A mathematical discipline for studying geometric objects, such as curves and surfaces, using the techniques of differential and integral calculus.
- Linear programming
A mathematical program for the optimization of a linear objective function, subject to linear constraints. Such functions capture the linear relationship between variables for the problem being optimized.
- Principle component analysis
A tool for transforming a set of observations with correlated variables into a set of linearly independent variables called principle components, making sure that the first principle component accounts for the largest variability of the data.
- Copy number variant
(CNV). Corresponds to abnormal number of copies of one or more segments in the genome. CNVs can be caused by structural rearrangements of the genome such as deletions, duplications, inversions and translocations.
- Bayesian network
A statistical model that describes the distribution of a set of random variables by a directed acyclic graph that represents the relationship among the random variables. For example, in a Bayesian network for a regulatory relationship for a set of genes, each variable represents a gene and each directed edge denotes either activating or repressing regulation between two genes.
- Steiner tree problem
Formulated on a network to find a minimum-length subnetwork that interconnects a set of seed nodes. Any two seed nodes may be connected by an edge or a path through other nodes.
- Random walk
A mathematical formulation of a number of successive random steps on a graph. It has been widely used to explain stochastic observations, such as diffusion in biological networks.
- Eigenvalue problem
The aim of this is to find a non-zero vector (that is, eigenvector), given a square matrix, such that the multiplication of the two is only different by a scalar factor.
- Set cover
Given a set of elements and subsets, the goal is to find the minimum number of subsets that cover all the elements.