The sequence of the human genome has provided a complete 'parts list' for constructing cellular functions. Yet negotiating how all the pieces fit together is a formidable task. Of all the tools scientists have at their fingertips, DNA microarrays are perhaps the most popular. These tiny chips can monitor the expression of thousands of genes at once and help group them into functional categories—for example, genes that are expressed to a greater or lesser degree in response to a drug or at different times during development. They also allow rapid genotyping of DNA sequence variations among individuals (See Box 1, 'SNP CHIP Software is on It's Way)

BioPathway Explorer displays experimental and simulation data. (Courtesy of BIOSoftware Systems, Inc.)

During the past decade, large-scale expression profiling experiments have generated a deluge of data. At the same time, commercial software packages have come onto the market to help extract meaning from these data. These products can pick out relevant subsets of genes with various analytical methods, scout the literature and databases to find commonalities among gene products, and draw interactive graphs and diagrams that can be queried with a click of the mouse.

The promise of pathways

Identifying hundreds of genes whose expression is markedly higher in one sample than in another provides a large amount of potentially valuable data; the trick is to home in on the genes that are relevant to the questions being asked. One way to do that is to find a subset of genes that are functionally connected through common pathways.

Justin Lamb's group at the Broad Institute in Cambridge, Massachusetts, uses an approach that he refers to as “functional annotation”. His method, based on the Kolmogorov-Smirnov statistical test, first identifies a gene expression signature that lights up when a pathway is active, say when an oncogene is ectopically expressed inside a cell. Next, it mines sample microarray data sets to see if the gene signature matches a pattern of genes differentially expressed in, for example, a particular tumor. “If we find a match, it will tell us that the oncogene is likely to be involved in the tumor,” says Lamb, whose work has uncovered functional relationships between different cancer genes1.

Lamb's group is now conducting proof-of-concept experiments to determine whether it would be feasible to derive a gene signature for every human gene. “If we need to profile every gene in 50 cell lines because their effects are exquisitely context-dependent, it is an impossible task. But if we only have to study two or three cell lines, it is doable,” he explains.

From signals to gene lists

Lamb's work is but one example of the ways in which microarray data can be mined to extract biological meaning. Whatever the approach, the first step in microarray analysis is to obtain a list of genes that are differentially expressed. Several suppliers of microarray hardware, such as Affymetrix, Agilent Technologies and Applied Biosystems, provide software that allows the user to go easily from fluorescent signals to a list of genes.

Applied Biosystems' 1700 Chemiluminescent Microarray Analyzer, launched in April, was designed to “show exactly what you are measuring on a microarray,” says Clark Mason, senior product line manager for gene expression arrays. The system carries integrated software for image analysis, quantification and normalization, as well as an Oracle database of annotations, including gene names, cross-referenced IDs, gene ontology and Celera's Panther Protein Classification System. The end product is a “meaningful list of genes that is rightfully annotated,” says Mason. The software allows seamless and customizable integration with third-party software, such as Spotfire's DecisionSite and Silicon Genetics' GeneSpring, for more complex data analysis.

In a similar vein, the NetAffx Analysis Center, an online resource created by Affymetrix, allows researchers to correlate results from their GeneChip probe array experiments with biological information from both Affymetrix's own and public databases.

Vector PathBlazer finds connections between two groups of genes that are differentially expressed in a microarray experiment. (Courtesy of Invitrogen.)

Data analysis and visualization

A large number of companies offer software for analyzing microarray experiments. “In the last 2 to 3 years there have been great improvements in the accuracy of DNA chips, and they have become cheaper. Most researchers can now obtain good, reliable data,” says Eric Olson, director of science at VizXLabs. “It has become possible to develop products with built-in access to standard statistical tools for analyzing these data.”

Many software packages provide users with a variety of analytical techniques (including time series and clustering analyses), gene and probe annotations by linking to internal and external databases, and tools for visualizing data and preparing figures. Some products have an even wider range of capabilities, such as integrating data from a wide variety of sources, adding a user's own statistical algorithms, and providing data and project management tools. In most cases, the software will guide a researcher from a list of genes to a first pass at a cellular pathway.

VizXLabs caters to “biologists working at the bench,” says Olson. “We spent a lot of time trying to build in processes for the kind of things that biologists would want to do.” Its GeneSifter product is entirely Web-based, avoiding the need for high-power hardware in-house. It uses pull-down menus from which the user can choose what kind of statistical tests to use and can set the P values and other parameters without having to do any of the number crunching. The result of a typical analysis is a list of annotated genes that are differentially expressed, “but you can also start to ask some questions about the list,” says Olson. “[GeneSifter] can tell you whether the genes are involved in cell cycle or apoptosis, or, in a time series experiment, it can find a subset of genes that are expressed at a later time.”

A typical gene expression analysis workflow in DecisionSite. (Courtesy of Spotfire.)

Although software products like GeneSifter are built with the nonspecialist in mind, they often have the flexibility to satisfy more sophisticated customers. For example, Silicon Genetics' flagship product, GeneSpring, provides “an interface to connect to R Bioconductor, a popular software product for conducting your own custom analysis,” says Kevin Wandryk, vice-president for marketing and business development. Similarly, although these software products are designed to analyze microarray data, many of them also handle other kinds of results. “As long as you have IDs and expression information, you can analyze any kind of data,” says Wandryk. “Some of our more savvy customers use GeneSpring to analyze data off their proteomics experiments.” (See Box 2, 'Proteomics—The Next Challenge.')

For researchers conducting large-scale analyses that incorporate different kinds of data and need more customization, Spotfire provides DecisionSite, a product that can analyze data from microarray experiments, high-throughput screens, chemical reactions and proteomics experiments. “Configurable workflows guide users through the analysis process,” says Matt Anstett, market manager for life science. With a click of a button, any data in a spreadsheet can be brought into DecisionSite and visualized as graphs or diagrams that can be quickly manipulated to ask various questions. “One of the features our customers like is how quickly you can answer questions,” says Ian Reid, vice-president of applications marketing. Visualizations can be annotated and shared among members of a team, and even emailed to distant collaborators, facilitating interactive discussions that can be archived in the 'library'.

Likewise, Inforsense's Knowledge Discovery Environment integrates different kinds of data for a range of analytical applications. “Extensive normalization and analysis tools are complemented by powerful integrated visualization capabilities,” says Stephen Misener. “You can, for example, use brushing and linking between different clustering visualizations to compare one clustering method with another.” Comparing patterns obtained with different statistical and analytical methods may help a scientist home in on a subset of genes that is most relevant to the process being studied. “If you are looking for distinguishing features of our software it would be the openness and flexibility of our platform,” says Misener. “You can easily select the components and data sources you need, and even add other algorithms or external programs.”

Agilent Technologies Inc. provides what they call “bridging informatics,” in other words, products that integrate different kinds of data and analyses. “We want our customer to use one set of products to produce results and formulate hypotheses that can then be analyzed using another set of tools,” says Francois Mandeville, business manager of informatics solutions. In addition to marketing Rosetta Resolver and Rosetta Luminator gene expression data analysis systems (both developed by Rosetta Biosoftware), Agilent provides Synapsia, “a project assessment tool that enables different team members to exchange and analyze information,” says Mandeville. Synapsia imports and stores different data types, correlates gene to protein expression, connects to third-party analysis software such as DecisionSite and GeneSpring, and links to internal and external databases. “Our customers told us that there are a lot of point solutions for data analysis, but there was a need for correlating information,” says Mandeville.

Databases for functional annotation

It is vital for most bioinformatics tools to link to external databases. Many software products are available that sift through private or public databases to find, for example, what pathway a particular protein participates in or find a pathway that connects two genes. Databases like GenBank and Gene Ontology (http://www.geneontology.org/) give descriptions about the gene themselves, whereas others, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG; http://www.genome.jp/kegg/) or the Alliance for Cell Signaling (http://signaling-gateway.org), provide information about biological interactions. On the commercial front, examples of 'pathway databases' include Biobase's TRANSPATH-NetPro, a database for molecular pathways and cellular network modeling, and GeneGo's Metacore, which also contains disease information data that can be overlaid on the pathways.

Ingenuity pioneered the hand-curated approach. Its database was created from millions of individually modeled relationships. “I have not seen another application with the same breadth and depth of information as Ingenuity Pathways Analysis,” says Daniel Siu, director of product management. Deployed through the Internet, the software can accept data from microarray, mass spectrometry or two-dimensional gel experiments. “Any results that can be translated into protein or gene IDs can be uploaded into the application to perform pathway analysis,” says Siu.

“Up until recently, pathway databases would have been considered an emerging technology. A few years ago most of our customers looked at these databases but did not find sufficient information about their genes. That has clearly changed,” says Affymetrix's Steve Lincoln.

From genes to pathways and back again

New software products take advantage of the knowledge contained in these databases to construct biological pathways and provide additional features. At one end of the spectrum are text-mining products, such as the one offered by Acumenta. It can search for several genes, using all their aliases, simultaneously querying over a dozen sites, including PubMed, the U.S. Patent and Trade Office, NCBI, sequence databases and international patent databases. Researchers can also ask the software to search for co-occurrences among the genes of interest. “If you start with a list of 50 genes, you can do a gene search and gather abstracts for each gene, and then rank them according to the articles that have all or most of the genes in them,” says Paul Martinez, vice-president for sales and marketing. In addition, the latest version of the software provides the “ability to annotate documents that are retrieved by Acumenta and share the annotations with the whole research team, compiling comments from each individual scientist,” says Martinez.

Ariadne Genomics' PathwayAssist (which is also distributed by Stratagene) combines text mining with tools to identify biological relationships among genes of interest and to visually display these relationships as interactive clickable maps. PathwayAssist incorporates a text-mining technology called MedScan that can sift through scientific abstracts and articles to extract biological relationships. “MedScan can process any type of file including PubMed abstracts and search through the full text of articles,” says Anton Yuryev, director of application science. PathwayAssist can also link to and download information from pathway databases, such as KEGG. In addition, the latest version of the product allows integration with GeneSpring and Iobion Informatics' ArrayAssist gene expression analysis software. Microarray gene expression data can be overlaid on a pathway to show how genes and proteins are affected under different conditions.

BIOSoftware Systems Inc.'s BioPathway Explorer is a tool for drawing pathways, which can then be edited and annotated. “One researcher described it as a whiteboard for pathways,” says company representative Ned Haubein. The software allows for all relevant information about a pathway to be incorporated in one place. “You can attach images and overlay time series data on top of a pathway so that you can easily pick out a pattern, rather than having to look at a table of numbers,” says Haubein. Another use for this product is to numerically simulate pathway models. Once reaction kinetics have been defined, equations describing an entire pathway are automatically generated by the software. A researcher can then use the software to find out, for example, which step in a pathway is sensitive to various manipulations and then experimentally test these results.

Integration between software for gene expression and pathway analysis can provide a powerful tool for designing experiments and interpreting results. Invitrogen's PathBlazer can find all biological reactions in which a set of genes identified by microarray analysis with the companion Vector Xpression software participates. The product uses the TRANSPATH database licensed from BIOBASE and other public databases to identify and visualize reactions and find links between them, while highlighting major features.

Gene expression data analysis in DecisionSite. (Courtesy of Spotfire.)

By alternating between the Vector Xpression and Vector PathBlazer software, scientists can more quickly home in on a particular gene or set of genes. “For example, with Xpression you can find through one Affy chip experiment that 192 genes are 99% certain of being differentially expressed in thyroid cancer samples. With PathBlazer you find proteins that are part of the apoptosis and cell cycle pathways and further analysis shows which proteins are common between the two pathways. You can then go back to your microarray data and ask whether these particular proteins are differentially expressed in thyroid cancer,” says David Pot, product manager for Vector Xpression.

PathBlazer can also be used to inform a scientist on how to design a microarray experiment. “If you know that an oncogene is involved in a cancer and that another gene is overexpressed, PathBlazer can find the shorter pathway between the two genes and then query a microarray experiment to see if the pathway is involved,” says Pot.

The currently available bioinformatics products have allowed scientists to carry out functional genomics studies that may not have been otherwise feasible. As genomic information continues to pour out of every laboratory, computational tools to analyze, annotate and visualize these data will also continue evolving. With the right software, going from genes to pathways will soon be only a few clicks away.Table 1

Table 1 Suppliers Guide: Companies offering bioinformatics software for gene expression and pathway analysis