Networks can be constructed from various types of genome-wide data to predict the properties of complex genetic systems. A new study shows that the power of networks to predict biological properties of the genes involved can be increased by integrating expression data with information about QTLs, transcription-factor binding sites and protein–protein interactions.

Schadt and colleagues used previously published yeast data, from several different sources, to construct networks that they then validated against independent data and their own experimental observations. First, they looked at expression data from 112 genetically distinct yeast offspring that were generated from a cross between two parental strains, and used this to construct a co-expression network. This showed them which genes tended to be expressed together and allowed them to identify 15 co-expressed modules. Twelve of these modules were significantly enriched for at least one functional category from the Gene Ontology (GO) database. They then compared the expression data with genotypic data from the same 112 strains in order to obtain expression QTLs (eQTLs), which show the contribution of variation at particular loci to expression differences. They found that 14 of their co-expression modules were associated with particular eQTL hot spots, suggesting coordinated control of the genes within each module by common factors associated with those eQTLs. Consistent with this, they found that the genes within each module were enriched for binding sites for transcription factors that are causally associated with the module.

However, these co-expression networks do not provide information on the connections between individual genes within modules. The authors therefore constructed Bayesian networks, into which they incorporated transcription-factor binding site and protein–protein interaction data from various sources as well as the expression and eQTL data. They constructed three networks: one based on expression data only, one based on expression and eQTL data, and one based on all the data types. When they compared the predictive powers of the three types of network, they all had a similar ability to predict the GO categories of the genes they contained. However, the complete data and eQTL networks were better at predicting the presence of specific transcription-factor target sites and the consequences of gene knockouts.

Using the Bayesian network that was based on all the data types, the authors identified potential causal regulators of the eQTL hot spots they had previously identified. Using deletions of the genes encoding these potential regulators, which included biosynthetic enzymes as well as transcription factors, they experimentally validated five of them for the predicted effect on module gene expression.

These results demonstrate the power of Bayesian networks and of using data from multiple sources. The predictive power will be increased further if data from different genetic and environmental backgrounds can be combined, and if time-series data and feedback effects can be incorporated into the algorithms.