Systems-level analyses identify extensive coupling among gene expression machines
Karolina Maciag1, Steven J Altschuler2, Michael D Slack2, Nevan J Krogan3, Andrew Emili3, Jack F Greenblatt3, Tom Maniatis4 & Lani F Wu2
- Bauer Center for Genomics Research, Harvard University, Cambridge, MA, USA
- Department of Pharmacology and Green Comprehensive Center for Molecular, Computational and Systems Biology, University of Texas Southwestern Medical Center, Dallas, TX, USA
- Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA
Correspondence to: Tom Maniatis4 Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA. Tel.: +1 617 495 1811; Fax: +1 617 495 3537; E-mail: Email: maniatis@mcb.harvard.edu
Correspondence to: Lani F Wu2 Department of Pharmacology and Green Comprehensive Center for Molecular, Computational and Systems Biology, University of Texas Southwestern Medical Center, 6001 Forest Park Blvd, Mail code 9041, Dallas, TX 75390, USA. Tel.: +1 214 645 6182; Fax: +1 214 645 5982; E-mail: Email: lani.wu@utsouthwestern.edu
Received 14 June 2005; Accepted 6 December 2005; Published online 17 January 2006
Article highlights
- Coupling among subprocesses of the gene expression pathway in yeast can be inferred from multiple, large-scale protein-interaction data sets.
- We integrated diverse protein interaction data sets and identified significant coupling among protein clusters involved in distinct subprocesses
- Two new independent, high-throughput protein interaction data sets corroborate the identification of clusters and of significant coupling links between them.
- Our approach may be generalized to identify molecular interactions among coupled subprocesses in other cellular contexts.
Synopsis
Experimental biology has provided significant insight into the individual subprocesses involved in eukaryotic gene expression: transcription, pre-mRNA capping, splicing and polyadenylation, mRNA export, and translation. However, only recently has the extensive coupling among these subprocesses been recognized (Maniatis and Reed, 2002; Orphanides and Reinberg, 2002), and much work remains to elucidate the interactions between molecular machineries involved.
To study coupling between gene expression subprocesses in yeast, we chose to take advantage of the large amount of protein interaction data available. High-throughput protein interaction assays are a potentially rich source of information to identify mechanisms that facilitate cooperation and communication between individual members of multiprotein complexes. While the sheer size and scope of these data sets render them a valuable resource, care must be taken because of the high error rates and biases inherent in high-throughput data.
We developed a three-part computational strategy to infer coupling among gene expression machineries from available interaction data that minimizes the impact of data set errors and biases, as outlined in Figure 1. First, we integrated source data sets to create a comprehensive, weighted protein interaction network (Figure 1A). As densely connected proteins are likely to correspond to molecular machineries, we clustered the resulting protein interaction network to suggest such groupings (Figure 1B). Finally, to evaluate and present hypotheses regarding coupling among gene expression subprocesses, we searched for intercluster network motifs—patterns in the arrangement of links and clusters—that are signatures of biological coupling between protein complexes (Figure 1C).
Figure 1
Overview of method (see main text and Supplementary information for details). (A) Construction of an integrated protein interaction network. Nodes represent proteins and links represent protein interactions. Line thickness corresponds to link weight (w). Input (1), relative quality calculation (2), and integration (3) of networks defined by interaction data sets S1–S13 generate a comprehensive, weighted protein interaction network. Pairwise CC scores (CC) are computed (4) using local network weight and topology information. (B) Unsupervised identification of biologically significant clusters in the network using an iterative clustering algorithm based on CC scores. Each randomized selection of k initial centers (1, in our analysis, k=70) followed by iterations of cluster definition and center repositioning (2) yields a clustering (3); the best clustering from multiple trials, as defined in the text, is chosen (4). Clusters generated are functionally characterized (5). (C) Motifs in the interaction network identify direct (1), cluster-mediated (2), and adaptor-mediated (3) coupling among clusters.
Full figure and legend (168K)Figures & Tables indexWe collected 13 yeast data sets for our analysis of the gene expression process, but our method can be generalized to any number of interaction data sets, biological processes, and organisms. Among the challenges of integrating a variety of diverse data sets are the varying coverage and quality of each one and the lack of a comprehensive gold standard. To address these difficulties, we developed a relative data set quality (RDQ) score based on measures of pairwise mutual data set overlap.
This concept of evaluating quality based solely on mutual comparison has been used by search engines for ranking web pages (Page et al, 1998). We weighted the contribution of each data set to the total network by its RDQ, so that the weight of each network edge reflects the calculated reliability of its source data. We compared different RDQ calculation methods, and selected one that both appropriately penalized data sets corrupted by false positives and corresponded to independent reliability criteria.
Because false positives and negatives nevertheless persist in the integrated network, we also derived for each pair of proteins a novel pairwise clustering coefficient (CC). Our CC definition provides a measure of the local, weighted network neighborhood around a pair of proteins, including pairs lacking a direct link. Heuristically, for each pair of proteins, links to common neighbors increase the CC, whereas links to uncommon neighbors indicate promiscuous binding and decrease the CC. The CC thus presents a powerful metric toward the identification of network regions that correspond to physical protein complexes. We then developed a method, based on the k-means algorithm, to cluster proteins within the integrated network using their CC-weighted links. The parameters used for both CC calculation and for our clustering algorithm were independently selected based on the ability to discern biologically interpretable clusters.
Within this network of interconnected protein clusters, supervised identification of motifs corresponding to known patterns of process coupling yielded hypotheses about coupling within the gene expression pathway. Our three motif types were as follows: direct coupling, identifying strong links between separate clusters; cluster-mediated coupling, identifying small clusters that link two larger clusters; and adaptor-mediated coupling, identifying proteins that may belong to either of two clusters, such as scaffolding linker proteins or proteins that shuttle between complexes and transiently associate with each. Ranking all occurrences of these motifs in the network yielded a prioritized list of experimentally verifiable biological hypotheses.
Our analysis was corroborated with independent, experimental protein interaction data from new, more comprehensive genome-wide complex precipitation data sets. Interactions defining higher ranking direct coupling motifs were drastically more likely to correspond to links in higher quality validation sets, as were links within identified clusters.
Results recapitulated previous knowledge about gene expression mechanisms and suggested new possibilities. Several of our clusters were found to correspond to well-characterized biological complexes, such as the spliceosome, the mRNA cleavage/polyadenylation factors, and the chromatin remodeling machineries SAGA, Swi/Snf, ISWI, and RSC (Figure 3A and B). Other clusters corresponded to functional modules that may represent sets of conditionally associated or interchangeable parts with shared interaction partners (Figure 3B and C), suggesting the architecture and coordination for dynamic, macromolecular structures such as the basal transcription machinery.
Figure 3
Protein clusters and top-ranking motifs suggest mechanisms of coupling between gene expression processes. Motifs described in the text are illustrated. For each cluster, n indicates the total number of proteins illustrated and m the total number of proteins in the cluster. For each specially noted subgroup of proteins within a cluster, P indicates the number of proteins in the subgroup. (A) Clusters may reconstruct well-known structural complexes. (B) Clusters reconstruct GTF machinery despite data missing from original data sets, and suggest conditional association of members in C8. (C) Seven DExD/H helicases in a single cluster identify a functional module. (D) Coupling of capping, elongation, and splicing suggested by the co-clustering and binding patterns of a cap-binding protein. (E) Top-ranked cluster-mediated coupling motif suggests coupling between elongation and mRNA quality control degradation. (F) Direct and adaptor-mediated coupling motifs suggest possible nuclear mRNA circularization, along with coupling among mRNA transcription and processing with export. (G) Cluster- and adaptor-mediated coupling motifs suggest coordination of transcription, export, and translation. (H) The top-ranking direct coupling motifs indicate possible coupling of mRNA export to chromatin silencing. (I) Direct coupling motif and co-clustering suggest coupling of transcription and mRNA export with translation and NMD, possibly at the nuclear pore.
Full figure and legend (323K)Figures & Tables indexCluster analysis also suggests specific hypotheses, such as the possibility of coupling among elongation, capping, and splicing machineries through cap-binding proteins associated with elongation complexes and poised to interact with splicing components (Figure 3D). Involvement in motifs may suggest functional roles for unknown proteins: for example, the unknown proteins YJL015Cp and YDR154Cp are implicated in mRNA quality control through cluster-mediated coupling of mRNA degradation to elongation, in a mechanism that possibly involves recruitment at transcription initiation (Figure 3E). A number of motifs support coupling of export with transcription termination, elongation, and chromatin modification: the export factors Pab1p and Crm1p, for example, are implicated in adaptor-mediated coupling of export to transcription termination (Figure 3F). Consistent with previous results, adaptor-mediated and direct coupling motifs identify the export factors Yra1p and Sub2p, respectively, in linking mRNA export to transcriptional elongation (Figure 3F and G). Interestingly, motifs involving the chromatin silencing proteins Sir2p, Sir4p, and Rap1p suggest a possible coupling between silencing and mRNA export (Figure 3H). Finally, a number of motifs suggest the coupling of translation and nonsense-mediated decay (NMD) factors with nuclear events, including co-transcriptional recruitment of NMD factors Upf1p and Nmd5p at the nuclear pore (Figure 3I), possible nuclear mRNA circularization facilitated in part by direct coupling between Cdc33p and Tif4632p (Figure 3F), and cluster-mediated coupling of translation and NMD to transcription and mRNA export (Figure 3G).
Besides providing hypotheses regarding coupling of gene expression subprocesses, this work introduces several techniques for manipulation and analysis of interaction data. As more data become available, they may be integrated into the analysis, allowing ongoing evolution of the accuracy of the model. Finally, our method is readily extendable to other biological processes and organisms and, thus, provides a novel, useful strategy to elucidate the molecular basis of coordination of cellular events.
Acknowledgements
We thank Natalie Thompson and Thanuja Premwardena for extraction of validation data, Naoko Tanese for comments, George Church, Aviv Regev, and June Oshiro for helpful discussion, and Guocheng Yuan, Nicole Francis, and Barbara Wold for critical reading of the manuscript. We also thank the anonymous reviewers at Nature MSB for valuable and constructive critiques. We thank the Bauer Center for Genomics Research for its generous support of LFW, SJA, and KM.
Contributions
KM, SJA, LFW, and TM designed the study and analyzed the results. KM and MS implemented the computational framework. NJK, AE, and JFG supplied the validation data. KM, SJA, LFW, and TM wrote the paper.
References
- Maniatis T, Reed R (2002) An extensive network of coupling among gene expression machines. Nature416: 499–506 | Article | PubMed | ISI | ChemPort |
- Orphanides G, Reinberg D (2002) A unified theory of gene expression. Cell108: 439–451 | Article | PubMed | ISI | ChemPort |
- Page L, Brin S, Motwani R, Winograd T (1998) The Page-Rank Citation Ranking: Bringing Order to the Web, Stanford Digital Libraries Working Paper


