Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Data integration: challenges for drug discovery

Key Points

  • The integration of data from many sources, especially new 'omic' platforms, is increasingly challenging not just because of increasing volume of data but because these data are highly diverse, expecially in a drug discovery enterprise that inherently draws on many disparate types of data.

  • The rise of the 'omics' disciplines means that data-driven research will need to be integrated with hypothesis-driven methods, as well as experimental with observational studies, and in general data integration will have cultural consequences, because it combines results from different scientific disciplines.

  • Even seemingly homogeneous data create integration challenges, beginning with cross-platform normalization, meta-analysis methods, multiple testing issues and new logical and statistical complexities that only increase with greater data heterogeneity.

  • The drug discovery enterprise, extending from molecules to human populations, depends on the combination of much more heterogenous data as well, and this carries its own challenges in terms of bridging widely varying scales and contexts.

  • Integration increases the dimensionality of data, both in terms of its arity or number of attributes, and in terms of its degree of connectivity; both are known to place stresses on data management, visualization and algorithmic analysis.

  • Scientific data integration entails making choices about the representation of data (for example, tables, matrices or graphs), each of which affords different integration and analysis techniques, and also requires careful attention to the syntax (format) and semantics (meaning) of the data; the latter can require sophisticated knowledge representation tools called ontologies.

  • Business data integration places more emphasis on decision support and on processes surrounding portfolio progression, which can be captured in a form of semantics called business rules.

  • Data mining is a form of integrative query that is highly exploratory and stresses pattern recognition and clue generation from heterogenous data sources, often presented on the Web.

Abstract

The effective integration of data and knowledge from many disparate sources will be crucial to future drug discovery. Data integration is a key element of conducting scientific investigations with modern platform technologies, managing increasingly complex discovery portfolios and processes, and fully realizing economies of scale in large enterprises. However, viewing data integration as simply an 'IT problem' underestimates the novel and serious scientific and management challenges it embodies — challenges that could require significant methodological and even cultural changes in our approach to data.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Data domains.
Figure 2: Data integration schemata.
Figure 3: Data normalization.
Figure 4: Data dimensionality.
Figure 5: Integration views.

Similar content being viewed by others

References

  1. Venkatesh, T. V. & Harlow, H. B. Integromics: challenges in data integration. Genome Biol. 3, reports4027.1–4027.3 (2002).

    Article  Google Scholar 

  2. Hodgson, J. Reconstructing pharmaceutical instinct. Nature Biotechnol. 20, 1199–1203 (2002).

    Article  CAS  Google Scholar 

  3. Reichhardt, T. It's sink or swim as a tidal wave of data approaches. Nature 399, 517–520 (1999).

    Article  CAS  PubMed  Google Scholar 

  4. Ball, P. The speed of computers. Nature 402, C61 (1999).

    Article  Google Scholar 

  5. Wise, J. An information bank? Curr. Drug Disc. Feb, 9–10 (2003).

    Google Scholar 

  6. Wong, L. Technologies for integrating biological data. Brief. Bioinform. 3, 389–404 (2002).

    Article  PubMed  Google Scholar 

  7. Stein, L. D. Integrating biological databases. Nature Rev. Genet. 4, 337–345 (2003).

    Article  CAS  PubMed  Google Scholar 

  8. Eckman, B. A. in Bioinformatics: Managing Scientific Data (eds Lacroix, Z. & Critchlow, T.) 35–74 (Morgan Kaufmann, San Francisco, 2003).

    Book  Google Scholar 

  9. Golden, J. Towards a tractable genome: knowledge management in drug discovery. Curr. Drug Disc. Feb, 17–20 (2003).

    Google Scholar 

  10. Ficenec, D. et al. Computational knowledge integration in biopharmaceutical research. Brief. Bioinform. 4, 260–278 (2003).

    Article  CAS  PubMed  Google Scholar 

  11. Choi, J. K. et al. Integrative analysis of multiple gene expression profiles applied to liver cancer study. FEBS Lett. 565, 93–100 (2004).

    Article  CAS  PubMed  Google Scholar 

  12. Smalheiser, N. R. Informatics and hypothesis-driven research. EMBO Rep. 3, 702 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Weinstein, J. N. 'Omic' and hypothesis-driven research in the molecular pharmacology of cancer. Curr. Opin. Pharmacol. 2, 361–365 (2002). Explores the interaction between traditional hypothesis-driven and 'omics'-based data-driven research.

    Article  CAS  PubMed  Google Scholar 

  14. Hui, G., Walhout, A. J. M. & Vidal, M. Integrating 'omic' information: a bridge between genomics and systems biology. Trends Genet. 19, 551–560 (2003).

    Article  CAS  Google Scholar 

  15. Boguski, M. S. & McIntosh, M. W. Biomedical informatics for proteomics. Nature 422, 233–237 (2003).

    Article  CAS  PubMed  Google Scholar 

  16. Hallgren, E., Palmer, M. W. & Milberg, P. Data diving with cross validation: an investigation of broad–scale gradients in Swedish weed communities. J. Ecol. 87, 1037–1051 (1999).

    Article  Google Scholar 

  17. Smith, G. D. & Ebrahim, S. Data dredging, bias, or confounding. BMJ 325, 1437–1438 (2002).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Tilstone, C. DNA microarrays: vital statistics. Nature 424, 610–612 (2003).

    Article  CAS  PubMed  Google Scholar 

  19. Reiner, A., Yekutieli, D. & Benjamini, Y. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 19, 368–375 (2003).

    Article  CAS  PubMed  Google Scholar 

  20. McShane, L. M., Radmacher, M. D., Freidlin, B., Yu, R., Li, M. C. & Simon, R. Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics 18, 1462–1469 (2002).

    Article  CAS  PubMed  Google Scholar 

  21. Simon, R., Radmacher, M. D., Dobbin, K. & McShane, L. M. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J. Natl. Cancer Inst. 95, 14–18 (2003).

    Article  CAS  PubMed  Google Scholar 

  22. Potter, J. D. At the interfaces of epidemiology, genetics and genomics. Nature Rev. Genet. 2, 142–147 (2001). Discusses the 'culture clash' at the intersection of different scientific disciplines brought together by integrative studies.

    Article  CAS  PubMed  Google Scholar 

  23. Glymour, C., Madigan, D., Pregibon, D. & Smyth, P. Statistical themes and lessons for data mining. Data Mining Knowledge Disc. 1, 11–28 (1997).

    Article  Google Scholar 

  24. Quackenbush, J. Microarray data normalization and transformation. Nature Genet. 32, 496–501 (2002).

    Article  CAS  PubMed  Google Scholar 

  25. Rajagopalan, D. A comparison of statistical methods for analysis of high density oligonucleotide array data. Bioinformatics 19, 1469–1476 (2003).

    Article  CAS  PubMed  Google Scholar 

  26. Kuo, W. P., Jenssen, T. K., Butte, A. J., Ohno–Machado, L. & Kohane, I. S. Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 18, 405–412 (2002).

    Article  CAS  PubMed  Google Scholar 

  27. Churchill, G. A. Fundamentals of experimental design for cDNA microarrays. Nature Genet. 32 suppl, 490–495 (2002).

  28. Yang, Y. H. et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 30, e15 (2002).

    Article  PubMed  PubMed Central  Google Scholar 

  29. Ideker, T., Thorsson, V., Siegel, A. F. & Hood, L. E. Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. J. Comput. Biol. 7, 805–817 (2000).

    Article  CAS  PubMed  Google Scholar 

  30. Pan, W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 18, 546–554 (2002).

    Article  CAS  PubMed  Google Scholar 

  31. Wolfinger, R. D. et al. Assessing gene significance from cDNA microarray expression data via mixed models. J. Comput. Biol. 8, 625–637 (2001).

    Article  CAS  PubMed  Google Scholar 

  32. Egger, M., Smith, G. D. & Phillips, A. N. Meta-analysis: principles and procedures. Brit. Med. J. 315, 1533–1537 (1997).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Cooper, H. & Hedges, L. V. (eds.) The Handbook of Research Systhesis (Russell Sage Foundation, New York NY, 1994).

    Google Scholar 

  34. Thomas, D. C. The problem of multiple inference in studies designed to generate hypotheses. Am. J. Epidemiol. 122, 1080–1095 (1985).

    Article  CAS  PubMed  Google Scholar 

  35. Perneger, T. V. What's wrong with Bonferroni adjustments. Brit. Med. J. 316, 1236–1238 (1998).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Bender, R. & Lange, S. Multiple test procedures other than Bonferroni's deserve wider use. Brit. Med. J. 318, 600 (1999).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Rhodes, D. R., Barrette, T. R., Rubin, M. A., Ghosh, D. & Chinnaiyan, A. M. Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Res. 62, 4427–4433 (2002).

    CAS  PubMed  Google Scholar 

  38. Stanton, J. L. & Green, D. P. Meta-analysis of gene expression in mouse preimplantation embryo development. Mol. Hum. Reprod. 7, 545–552 (2001).

    Article  CAS  PubMed  Google Scholar 

  39. Callow, M. J., Dudoit, S., Gong, E. L., Speed, T. P. & Rubin, E. M. Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Res. 10, 2022–2029 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Tusher, V. G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 98, 5116–5121 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Aitchison, J. D. & Galitski, T. Inventories to insights. J. Cell Biol. 161, 465–469 (2003). Argues for the importance of data integration to systems biology.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Choi, J. K. et al. Integrative analysis of multiple gene expression profiles applied to liver cancer study. FEBS Lett. 565, 93–100 (2004).

    Article  CAS  PubMed  Google Scholar 

  43. Jiang, H. et al. Joint analysis of two microarray gene–expression data sets to select lung adenocarcinoma marker genes. BMC Bioinformatics 5, 81 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  44. Stone, J. V. Independent component analysis: an introduction. Trends Cog. Sci. 6, 59–64 (2002).

    Article  Google Scholar 

  45. Saidi, S. A. et al. Independent component analysis of microarray data in the study of endometrial cancer. Oncogene 23, 6677–6683 (2004).

    Article  CAS  PubMed  Google Scholar 

  46. Schum, D. A. The Evidential Foundations of Probabilistic Reasoning (Wiley–Interscience, New York, 1994).

    Google Scholar 

  47. Pearl, J. Causality: Models, Reasoning, and Inference (Cambridge Univ. Press, Cambridge, UK, 2000). A seminal work on mathematical foundations for the analysis of causality in data.

    Google Scholar 

  48. Hinneburg, A. & Keim, D. A. In Proceedings of the 25th International Conference on Very Large Data Bases (eds Atkinson, M. P., Orlowska, M. E., Valduriez, P., Zdonik, S. B. & Brodie, M. L.) 506–517 (Morgan Kaufmann, San Francisco, 1999).

    Google Scholar 

  49. Halpin, T. Information Modeling and Relational Databases (Academic Press, San Diego, 2001).

    Google Scholar 

  50. Spence, R. Information Visualization (ACM Press, 2001).

    Google Scholar 

  51. Tufte, E. R. The Visual Display of Quantitative Information. Second Edition. (Graphics Press, Cheshire CT, 2001). Already a classic text, from the maven of scientific visualization.

    Google Scholar 

  52. Hierarchical and Geometrical Methods in Scientific Visualization. Farin, G. E., Hamann, B. & Hagen, H., eds. (Springer Verlag, 2003).

  53. Weber, R., Schek, H. & Blott, S. A quantitative analysis and performance study for similarity search methods in high dimensional spaces. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB) pp. 194–205 (Morgan Kaufmann, San Francisco CA, 1998).

    Google Scholar 

  54. Bellman, R. Adaptive Control Processes: A Guided Tour (Princeton University Press, Princeton NJ, 1961). Coined the term 'curse of dimensionality' in establishing certain mathematical difficulties in dealing with data consisting of many independent features.

    Book  Google Scholar 

  55. Indyk, P. & Motwani, R. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. of the 30th ACM Symp. on Theory of Computing, pp. 604–612, (Addison Wesley, Boston MA, 1998).

    Google Scholar 

  56. Jungnickel, D. Graphs, Networks, and Algorithms (Springer Verlag, Berlin, 1999).

    Book  Google Scholar 

  57. Li, S. et al. A map of the interactome network of the metazoan C. elegans. Science 303, 540–543 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Toyoda, T., Mochizuki, Y. & Konagaya, A. GSCope: a clipped fisheye viewer effective for highly complicated biomolecular network graphs. Bioinformatics 19, 437–438 (2003).

    Article  CAS  PubMed  Google Scholar 

  59. Searls, D. B. Data integration — connecting the dots. Nature Biotechnol. 21, 844–845 (2003).

    Article  CAS  Google Scholar 

  60. Searls, D. B. Pharmacophylogenomics: genes, evolution and drug targets. Nature Rev. Drug Discov. 2, 613–623 (2003).

    Article  CAS  Google Scholar 

  61. Wooley, J. C. Trends in computational biology. J. Comp. Biol. 6, 459–474 (1999).

    Article  CAS  Google Scholar 

  62. Grunling, C. et al. Dyslexia: the possible benefit of multimodal integration of fMRI- and EEG-data. J. Neural Transm. 111, 951–969 (2004).

    Article  CAS  PubMed  Google Scholar 

  63. Rector, A. L., Rogers, J., Roberts, A. & Wroe, C. Scale and context: issues in ontologies to link health- and bio-informatics. Proc. AMIA Symp. 642–646 (2002).

  64. Gund, P., Dippolito, M. & Shimshock, Y. Much ado about data. Curr. Drug Disc. Feb 29–32 (2003).

  65. Bredel, M. & Jacoby, E. Chemogenomics: an emerging strategy for rapid target and drug discovery. Nature Rev. Genet. 5, 262–275 (2004). A comprehensive review of the trend in pharmacology to integrate chemical with genomic data.

    Article  CAS  PubMed  Google Scholar 

  66. Shah, S. P. et al. Pegasys: software for executing and integrating analyses of biological sequences. BMC Bioinformatics 5, 40 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  67. Huminiecki, L., Lloyd, A. T. & Wolfe, K. H. Congruence of tissue expression profiles from Gene Expression Atlas, SAGEmap and TissueInfo databases. BMC Genomics 4, 31 (2003).

    Article  PubMed  PubMed Central  Google Scholar 

  68. Bader, G. D. & Hogue, C. W. V. Analyzing yeast protein–protein interaction data obtained from different sources. Nature Biotech. 20, 991–997 (2002).

    Article  CAS  Google Scholar 

  69. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95, 14863–14868 (1998).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Blower, P. E. et al. Pharmacogenomic analysis: correlating molecular substructure classes with microarray gene expression data. Pharmacogenomics J. 2, 259–271 (2002).

    Article  CAS  PubMed  Google Scholar 

  71. Yeger-Lotem, E. & Margalit, H. Detection of regulatory circuits by integrating the cellular networks of protein-protein interactions and transcription regulation. Nucleic Acids Res. 31, 6053–6061 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Lee, S. G., Hur, J. U. & Kim, Y. S. A graph-theoretic modeling on GO space for biological interpretation of gene clusters. Bioinformatics 20, 381–388 (2004).

    Article  CAS  PubMed  Google Scholar 

  73. Jansen, R. et al. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 302, 449–453 (2003).

    Article  CAS  PubMed  Google Scholar 

  74. Troyanskaya, O. G., Dolinski, K., Owen, A. B., Altman, R. B. & Botstein, D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl. Acad. Sci. USA 100, 8348–8353 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Imoto, S. et al. Use of gene networks for identifying and validating drug targets. J. Bioinform. Comput. Biol. 1, 459–474 (2003).

    Article  CAS  PubMed  Google Scholar 

  76. Alm, E. & Arkin, A. P. Biological networks. Curr. Opin. Struct. Biol. 13, 193–202 (2003).

    Article  CAS  PubMed  Google Scholar 

  77. Barabasi, A. L. & Oltvai, Z. N. Network biology: understanding the cell's functional organization. Nature Rev. Genet. 5, 101–113 (2004).

    Article  CAS  PubMed  Google Scholar 

  78. Yeger-Lotem, E. & Margalit, H. Detection of regulatory circuits by integrating the cellular networks of protein–protein interactions and transcription regulation. Nucleic Acids Res. 31, 6053–6061 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Stelling, J., Klamt, S., Bettenbrock, K., Schuster, S. & Gilles, E. D. Metabolic network structure determines key aspects of functionality and regulation. Nature 420, 190–193 (2002).

    Article  CAS  PubMed  Google Scholar 

  80. Kohler, J. & Schulze-Kremer, S. The semantic metadatabase (SEMEDA): ontology based integration of federated molecular biological data sources. In Silico Biol. 2, 219–231 (2002).

    PubMed  Google Scholar 

  81. Kerr, R. A. English-metric miscue doomed Mars mission. ScienceNOW (American Association for the Advancement of Science) 30 Sept (1999).

    Google Scholar 

  82. Hegde, P. S., White, I. R. & Debouck, C. Interplay of transcriptomics and proteomics. Curr. Opin. Biotechnol. 14, 647–651 (2003).

    Article  CAS  PubMed  Google Scholar 

  83. Spellman, P. T. et al. Design and implementation of microarray gene expression markup language (MAGE–ML). Genome Biol. 3, RESEARCH0046 (2002).

    Article  PubMed  PubMed Central  Google Scholar 

  84. Brazma, A. et al. Minimum Information About a Microarray Experiment (MIAME)–toward standards for microarray data. Nature Genet. 29, 365–371 (2001).

    Article  CAS  PubMed  Google Scholar 

  85. Hermjakob, H. et al. The HUPO PSI's molecular interaction format–a community standard for the representation of protein interaction data. Nature Biotechnol. 22, 177–183 (2004).

    Article  CAS  Google Scholar 

  86. Wain, H. M. et al. Genew: the human gene nomenclature database. Nucl. Acids Res. 30, 169–171 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  87. Pruitt, K. D. & Maglott, D. R. RefSeq and LocusLink: NCBI gene-centered resources. Nucl. Acids Res. 29, 137–140 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. Clark, T., Martin, S. & Liefeld, T. Globally distributed object identification for biological knowledgebases. Brief. Bioinform. 5, 59–70 (2004).

    Article  CAS  PubMed  Google Scholar 

  89. Ashburner, M. & Lewis, S. On ontologies for biologists: the Gene Ontology – untangling the web. Novartis Found. Symp. 247, 66–80 (2002).

    Article  CAS  PubMed  Google Scholar 

  90. Stevens, R., Goble, C. A. & Bechhofer, S. Ontology-based knowledge representation for bioinformatics. Brief. Bioinform. 1, 398–414 (2000).

    Article  CAS  PubMed  Google Scholar 

  91. Bard, J. B. L. & Rhee, S. Y. Ontologies in biology: design, applications and future challenges. Nature Rev. Genet. 5, 213–222 (2004). An extensive review of the forms and various uses of ontologies in biology.

    Article  CAS  PubMed  Google Scholar 

  92. Demir, E. et al. An ontology for collaborative construction and analysis of cellular pathways. Bioinformatics 20, 349–356 (2004).

    Article  CAS  PubMed  Google Scholar 

  93. Roux-Rouquie, M., Caritey, N., Gaubert, L. & Rosenthal-Sabroux, C. Using the Unified Modelling Language (UML) to guide the systemic description of biological processes and systems. Biosystems 75, 3–14 (2004).

    Article  PubMed  Google Scholar 

  94. Campagne, F. et al. Quantitative information management for the biochemical computation of cellular networks. Sci. STKE 2004, 1–12 (2004).

    Article  Google Scholar 

  95. Gault, L. V., Shultz, M. & Davies, K. J. Variations in Medical Subject Headings (MeSH) mapping: from the natural language of patron terms to the controlled vocabulary of mapped lists. J. Med. Libr. Assoc. 90, 173–180 (2002).

    PubMed  PubMed Central  Google Scholar 

  96. Blake, J. Bio-ontologies — fast and furious. Nature Biotechnol. 22, 773–774 (2004).

    Article  CAS  Google Scholar 

  97. Rothwell, D. J. SNOMED-based knowledge representation. Methods Inf. Med. 34, 209–213 (1995).

    Article  CAS  PubMed  Google Scholar 

  98. Burgun, A., Botti, G., Fieschi, M. & Le Beux, P. Issues in the design of medical ontologies used for knowledge sharing. J. Med. Syst. 25, 95–108 (2001).

    Article  CAS  PubMed  Google Scholar 

  99. Martin-Sanchez, F., Maojo, V. & Lopez-Campos, G. Integrating genomics into health information systems. Methods Inf. Med. 41, 25–30 (2002).

    Article  CAS  PubMed  Google Scholar 

  100. Stevens, R. et al. TAMBIS: transparent access to multiple bioinformatics information sources. Bioinformatics 16, 184–185 (2000).

    Article  CAS  PubMed  Google Scholar 

  101. Kohler, J., Philippi, S. & Lange, M. SEMEDA: ontology based semantic integration of biological databases. Bioinformatics 19, 2420–2427 (2003).

    Article  PubMed  CAS  Google Scholar 

  102. Woods, W. A. Important issues in knowledge representation. Proc. IEEE 74, 1322–1334 (1986).

    Article  Google Scholar 

  103. McGuinness, D. L. & Patel-Schneider, P. F. in Proc. 15th Nature Conf. on Artificial Intelligence, pp. 608–614 (AAAI Press, Menlo Park, 1998).

    Google Scholar 

  104. Kohane, I. S. Bioinformatics and clinical informatics: the imperative to collaborate. J. Am. Med. Inform. Assoc. 7, 512–516 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  105. Tsuji, N. Selection of an internal control gene for quantitation of mRNA in colonic tissues. Anticancer Res. 22, 4173–4178 (2002).

    CAS  PubMed  Google Scholar 

  106. Tricarico, C., et al. Quantitative real-time reverse transcription polymerase chain reaction: normalization to rRNA or single housekeeping genes is inappropriate for human tissue biopsies. Anal. Biochem. 309, 293–300 (2002).

    Article  CAS  PubMed  Google Scholar 

  107. Turban, E. & Aronson, J. E. Decision Support Systems and Intelligent Systems 6th Edition (Prentice–Hall, Upper Saddle River, 2000).

    Google Scholar 

  108. Norvig, P. PowerPoint: shot with its own bullets. Lancet 362, 343–344 (2003). An entertaining indictment of bulleted and animated presentations as leading to a 'dumbing down' of complex information.

    Article  PubMed  Google Scholar 

  109. Tufte, E. R. The Cognitive Style of PowerPoint (Graphics Press, Cheshire, 2003).

    Google Scholar 

  110. Von Halle, B. Business Rules Applied (John Wiley & Sons, New York, 2001).

    Google Scholar 

  111. Berry, M. J. A. & Linoff, G. Data Mining Techniques (John Wiley & Sons, New York, 1997).

    Google Scholar 

  112. Tanabe, L. & Wilbur, W. J. Tagging gene and protein names in biomedical text. Bioinformatics 18, 1124–1132 (2002).

    Article  CAS  PubMed  Google Scholar 

  113. Boffetta, P. Molecular epidemiology. J. Intern. Med. 248, 447–454 (2000).

    Article  CAS  PubMed  Google Scholar 

  114. Konyndyk, K. Introductory Modal Logic (Univ. of Notre Dame Press, Notre Dame, 1986).

    Google Scholar 

  115. Blom, J. A. Temporal logics and real time expert systems. Comput. Methods Programs Biomed. 51, 35–49 (1996).

    Article  CAS  PubMed  Google Scholar 

  116. Parascandola, M. & Weed, D. L. Causation in epidemiology. J. Epidemiol. Community Health 55, 905–912 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  117. Weed, D. L. Environmental epidemiology: basics and proof of cause-effect. Toxicology 181–182, 399–403 (2002).

    Article  PubMed  Google Scholar 

  118. Minker, J. in Proc. 6th Conf. on Automated Deduction (Lecture Notes in Computer Science 138) 292–308 (Springer, New York NY, 1982).

    Book  Google Scholar 

  119. Zaniolo, C. Database relations with null values. J. Comput. and Systems Sci. 29, 142–166 (1984).

    Article  Google Scholar 

  120. Searls, D. B. Mining the bibliome. Pharmacogenomics J. 1, 88–89 (2001).

    Article  CAS  PubMed  Google Scholar 

  121. Brandt, C. A. et al. Metadata-driven creation of data marts from an EAV-modeled clinical research database. Int. J. Med. Inform. 65, 225–241 (1997).

    Article  Google Scholar 

  122. Nadkarni, P. M. QAV: querying entity-attribute-value metadata in a biomedical database. Comput. Methods Programs Biomed. 53, 93–103 (1997).

    Article  CAS  PubMed  Google Scholar 

  123. Nadkarni, P. M., Marenco, L., Chen, R., Skoufos, E., Shepherd, G. & Miller, P. Organization of heterogeneous scientific data using the EAV/CR representation. J. Am. Med. Inform. Assoc. 6, 478–493 (1999).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  124. Tavazoie, S. et al. Systematic determination of genetic network architecture. Nature Genet. 22, 281–285 (1999). An early demonstration of the integration of genomic sequence data with microarray data to discover novel regulatory elements.

    Article  CAS  PubMed  Google Scholar 

  125. Fink, J. L. et al. 2HAPI: a microarray data analysis system. Bioinformatics 19, 1443–1445 (2003).

    Article  CAS  PubMed  Google Scholar 

  126. Roven, C. & Bussemaker, H. J. REDUCE: An online tool for inferring cis-regulatory elements and transcriptional module activities from microarray data. Nucleic Acids Res. 31, 3487–3490 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  127. Coessens, B. et al. INCLUSive: a web portal and service registry for microarray and regulatory sequence analysis. Nucleic Acids Res. 31, 3468–3470 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  128. Tong, A. H. et al. A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 295, 321–324 (2002).

    Article  CAS  PubMed  Google Scholar 

  129. Ettwiller, L. M., Rung, J. & Birney, E. Discovering novel cis-regulatory motifs using functional networks. Genome Res. 13, 883–895 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  130. Reiss, D. J. & Schwikowski, B. Predicting protein-peptide interactions via a network-based motif sampler. Bioinformatics 20, I274–I282 (2004).

    Article  CAS  PubMed  Google Scholar 

  131. Obenauer, J. C. & Yaffe, M. B. Computational prediction of protein-protein interactions. Methods Mol. Biol. 261, 445–468 (2004).

    CAS  PubMed  Google Scholar 

  132. Steffen, M. et al. Automated modelling of signal transduction networks. BMC Bioinformatics 3, 34 (2002). A tour de force experiment that reconstructed signaling cascades from 'first principles,' by integration of expression and interaction data.

    Article  PubMed  PubMed Central  Google Scholar 

  133. Jansen, R., Lan, N., Qian, J. & Gerstein, M. Integration of genomic datasets to predict protein complexes in yeast. J. Struct. Funct. Genomics 2, 71–81 (2002).

    Article  CAS  PubMed  Google Scholar 

  134. Zhang, L. V., Wong, S. L., King, O. D. & Roth, F. P. Predicting co–complexed protein pairs using genomic and proteomic data integration. BMC Bioinformatics 5, 38 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  135. Simpson, J. C., Wellenreuther, R., Poustka, A., Pepperkok, R. & Wiemann, S. Systematic subcellular localization of novel proteins identified by large-scale cDNA sequencing. EMBO Rep. 1, 287–292 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  136. Del Val, C. et al. High-throughput protein analysis integrating bioinformatics and experimental assays. Nucleic Acids Res. 32, 742–748 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  137. Morley, M. et al. Genetic analysis of genome-wide variation in human gene expression. Nature 430, 743–747 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  138. Wang, J., Williams, R. W. & Manly, K. F. WebQTL: web-based complex trait analysis. Neuroinformatics 1, 299–308 (2003).

    Article  PubMed  Google Scholar 

  139. Pazos, F., Helmer-Citterich, M., Ausiello, G. & Valencia, A. Correlated mutations contain information about protein-protein interaction. J. Mol. Biol. 271, 511–523 (1997).

    Article  CAS  PubMed  Google Scholar 

  140. Kim, W. K., Bolser, D. M. & Park, J. H. Large-scale co-evolution analysis of protein structural interlogues using the global protein structural interactome map (PSIMAP). Bioinformatics 20, 1138–1150 (2004).

    Article  CAS  PubMed  Google Scholar 

  141. Afonnikov, D. A. & Kolchanov, N. A. CRASP: a program for analysis of coordinated substitutions in multiple alignments of protein sequences. Nucleic Acids Res. 32, W64–W68 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  142. Tan, S. H., Zhang, Z. & Ng, S. K. ADVICE: Automated Detection and Validation of Interaction by Co-Evolution. Nucleic Acids Res. 32, W69–W72 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  143. Zhang, Z. & Ng, S. K. InterWeaver: interaction reports for discovering potential protein interaction partners with online evidence. Nucleic Acids Res. 32, W73–W75 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  144. Vieth, M., Higgs, R. E., Robertson, D. H., Shapiro, M., Gragg, E. A. & Hemmerle, H. Kinomics-structural biology and chemogenomics of kinase inhibitors and targets. Biochim. Biophys. Acta 1697, 243–257 (2004).

    Article  CAS  PubMed  Google Scholar 

  145. Giaever, G. et al. Chemogenomic profiling: identifying the functional interactions of small molecules in yeast. Proc. Nature Acad. Sci. USA 101, 793–798 (2004).

    Article  CAS  Google Scholar 

  146. Mestres, J. Computational chemogenomics approaches to systematic knowledge-based drug discovery. Curr. Opin. Drug Discov. Devel. 7, 304–313 (2004).

    CAS  PubMed  Google Scholar 

  147. Gunther, E. C., Stone, D. J., Gerwien, R. W., Bento, P. & Heyes, M. P. Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro. Proc. Natl. Acad. Sci. USA 100, 9608–9613 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  148. Blower, P. E. et al. Pharmacogenomic analysis: correlating molecular substructure classes with microarray gene expression data. Pharmacogenomics J. 2, 259–271 (2002).

    Article  CAS  PubMed  Google Scholar 

  149. Pinhasov, A. et al. Gene expression analysis for high throughput screening applications. Comb. Chem. High Throughput Screen. 7, 133–140 (2004).

    Article  CAS  PubMed  Google Scholar 

  150. Veselovsky, A. V. et al. Protein–protein interactions: mechanisms and modification by drugs. J. Mol. Recognit. 15, 405–422 (2002).

    Article  CAS  PubMed  Google Scholar 

  151. Zeng, J. Mini-review: computational structure-based design of inhibitors that target protein surfaces. Comb. Chem. High Throughput Screen. 3, 355–362 (2000).

    Article  CAS  PubMed  Google Scholar 

  152. Wall, M. E., Rechtsteiner, A. & Rocha, L. M. in A Practical Approach to Microarray Data Analysis (ed. Berrar, D. P., Dubitzky, W. & Granzow, M.) 91–109 (Kluwer, Norwell MA, 2003).

    Book  Google Scholar 

  153. Kasprzyk, A. et al. EnsMart: a generic system for fast and flexible access to biological data. Genome Res. 14, 160–169.

  154. Lenhard, B., Hayes, W. S. & Wasserman, W. W. GeneLynx: a gene-centric portal to the human genome. Genome Res. 11, 2151–2157.

  155. Safran, M. et al. GeneCards 2002: towards a complete, object-oriented, human gene compendium. Bioinformatics 18, 1542–1543.

  156. Tsai, J. et al. RESOURCERER: annotating and linking microarray resources within and across species. Genome Biol. 2, software0002. 1–0002. 4 (2001).

    Article  Google Scholar 

  157. O'Neil, M. J. et al. The Merck Index (13th edition). (Merck & Co., Rahway NJ, 2001).

    Google Scholar 

  158. Mootha, V. K. et al. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genet. 34, 267–273 (2003).

    Article  CAS  PubMed  Google Scholar 

  159. Al-Shahrour, F., Díaz-Uriarte, R. & Dopazo, J. FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 20, 578–580 (2004).

    Article  CAS  PubMed  Google Scholar 

  160. Zeeberg, B. R. et al. GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol. 4, R28.

  161. Berriz, G. F., King, O. D., Bryant, B., Sander, C. & Roth, F. P. Characterizing gene sets with FuncAssociate. Bioinformatics 19, 2502–2504 (2003).

    Article  CAS  PubMed  Google Scholar 

  162. Barriot, R. et al. New strategy for the representation and the integration of biomolecular knowledge at a cellular scale. Nucleic Acids Res. 32, 3581–3589 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  163. Yu, H., Hatzivassiloglou, V., Rzhetsky, A. & Wilbur, W. J. Automatically identifying gene/protein terms in MEDLINE abstracts. J. Biomed. Inform. 35, 322–330 (2002).

    Article  CAS  PubMed  Google Scholar 

  164. Becker, K. G. et al. PubMatrix: a tool for multiplex literature mining. BMC Bioinformatics 4, 61.

  165. Giuliano, K. A., Haskins, J. R. & Taylor, D. L. Advances in high content screening for drug discovery. Assay Drug Dev. Technol. 1, 565–577 (2003).

    Article  CAS  PubMed  Google Scholar 

  166. Jenssen, T. K., Laegreid, A., Komorowski, J. & Hovig, E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genet. 28, 21–28 (2001).

    CAS  PubMed  Google Scholar 

  167. Wilbur, W. J. et al. Analysis of biomedical text for chemical names: a comparison of three methods. Proc. AMIA Symp. 176–180 (1999).

  168. Singh, S. B., Hull, R. D. & Fluder, E. M. Text Influenced Molecular Indexing (TIMI): a literature database mining approach that handles text and chemistry. J. Chem. Inf. Comput. Sci. 43, 743–752 (2003).

    Article  CAS  PubMed  Google Scholar 

  169. Dennis, G. Jr. et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 4, P3.

  170. Dahlquist, K. D., Salomonis, N., Vranizan, K., Lawlor, S. C. & Conklin, B. R. GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nature Genet. 31, 19–20.

  171. Bugrim, A, Nikolskaya, T, Nikolsky, Y. Early prediction of drug metabolism and toxicity: systems biology approach and modeling. Drug Discov. Today 9, 127–135 (2004).

    Article  CAS  PubMed  Google Scholar 

  172. Tatusov, R. L. et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41 (2003).

    Article  PubMed  PubMed Central  Google Scholar 

  173. Li, L., Stoeckert, C. J. Jr. & Roos, D. S. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  174. Lee, Y. et al. Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res. 12, 493–502 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  175. Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  176. Birney, E. et al. Ensembl 2004. Nucleic Acids Res. 32, D468–D470 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  177. Roignant, J. Y. et al. Absence of transitive and systemic pathways allows cell-specific and isoform-specific RNAi in Drosophila. RNA 9, 299–308 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  178. Wuchty, S. Interaction and domain networks of yeast. Proteomics 2, 1715–1723 (2002).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The author thanks M.Hurle, D. Crowther, S. Dear, J. Aaronson, P. Agarwal, M. Lutz and N. Odendahl for valuable contributions to this review.

Author information

Authors and Affiliations

Authors

Ethics declarations

Competing interests

The author declares no competing financial interests.

Related links

Related links

FURTHER INFORMATION

BioWisdom, Ltd.

Business Process Execution Language (BPEL)

CellSpace, Cellomics, Inc.

COG (Clusters of Orthologous Genes)

DAVID

Ensembl

EnsMart

FatiGO

GeneCards

GeneGo, Inc.

GeneLynx

GenMAPP

GoMiner

Integromics, SL

IPA, Ingenuity, Inc.

Life Sciences Identifier (LSID)

Open Biological Ontologies

OrthoMCL

PathwayAssist, StrataGene, Inc.

PubGene, Inc.

PubMatrix

RefSeq

RESOURCERER

SpotFire, Inc.

TOGA (TIGR Orthologous Gene Alignments)

UCSC Genome Browser

Glossary

MOORE'S LAW

A prediction made by engineer Gordon Moore in 1965 that the number of transistors per integrated circuit would increase exponentially with time. This has proven accurate, and moreover has been achieved at roughly constant cost per chip.

PRIMARY DATA

Data close to the experimental source; 'raw', unprocessed data.

DERIVED DATA

The result of processing, refining or interpreting data in any way.

META-ANALYSIS

The combination of results from several studies or experiments that bear on the same or similar questions or hypotheses. It entails the application of a variety of principles, methods and statistical techniques that are designed to ensure the validity of conclusions that might go beyond those supported by the individual studies.

DATA PROVENANCE

Information about the source and history of particular data items or sets, which is generally necessary to ensure their integrity, currency and reliability.

DECISION-SUPPORT SYSTEMS

A broadly defined term for computational systems and methods that aid humans in any decision-making process, generally for unstructured or semi-structured tasks.

NORMALIZATION

In data analysis, the processing of data that arises from distinct sources or experiments so as to allow their standardized evaluation and direct comparison, and which involves various transformations, corrections and weightings to accommodate systematic differences. In database design, the logical grouping of data into separate, well-factored tables according to rules that avoid known pitfalls of data redundancy and development of inconsistencies.

RESEARCH SYNTHESIS

The application of techniques ranging from systematic literature search to statistical meta-analysis compiling and summarizing evidence relating to a specific scientific question.

MULTIPLE TESTING

The testing of many independent null hypotheses in the same experiment, such that more stringent significance thresholds must be used to account for the multiple 'opportunities' for a chance false rejection. The simplest such adjustment is the classic Bonferroni correction.

ALEATORY UNCERTAINTY

Uncertainty arising from or associated with the inherent, irreducible, natural randomness of a system or process. The focus is therefore on the actual physical state of the system or process.

FREQUENTIST

Describing a conception of probability as being applicable solely to the (measurable) relative frequencies of random events. Also describes something/one that adheres to this approach.

EPISTEMIC UNCERTAINTY

Uncertainty associated with a model of a system or process and its parameters that arises from limitations on the data available or on causal understanding. The focus is on our state of knowledge of the system or process, or beliefs about it.

BAYESIAN

Describing an alternative to the frequentist approach that allows probabilities to apply to a broader range of propositions and beliefs. Its mathematical basis is Bayes' Rule, which supports initial assessments of the relative plausibility of hypotheses and their incremental refinement through additional observations. Also describes something/one that adheres to this approach.

DEMPSTER–SHAFER THEORY

A system for combining lines of evidence that deals in belief functions, which assign probabilities to sets of possibilities rather than single events, and provides formal rules for their combination.

ARITY

In mathematics, the number of arguments to a function or relation. In databases, therefore, the number of columns in a data table, or the number of attributes associated with an entity.

CURSE OF DIMENSIONALITY

The fanciful name given to statistical and algorithmic challenges posed by very high-dimensional spaces necessary to map data with many features, in which the resulting exponential growth in hypervolumes means that the data inevitably will be distributed ever more sparsely.

PRINCIPAL COMPONENTS ANALYSIS

(PCA). A mathematical technique to reduce a high-dimensional space to just a few orthogonal axes called principal components, which are defined in terms of weighted combinations of variables that maximize the variance of the data along those axes. PCA is typically used for clustering data and making its visualization easier.

CHEMOGENOMICS

The integration of genomics with chemistry — for instance, by using genomic readouts of compound action on a biological system, or adopting gene-family-centred approaches to drug discovery.

BAYESIAN NETWORK

A data structure that consists of a graph or network in which the nodes represent random variables and the connections describe dependencies between them. Such networks support various forms of probabilistic and causal inference based on several sources of data.

SYNTAX

The rules that govern the valid arrangements of symbols in some formal representation system. Examples include the grammar of a language, allowable structures of logical expressions or specifications of data formats.

SEMANTICS

The significance or meaning of symbols or arrangements of symbols in some formal representation system. Examples include the denotation of a sentence, the action specified by a statement in a computer program or the real-world referent of a data entity or attribute.

CONTROLLED VOCABULARY

A set of terms (such as gene names) that are unambiguous and non-redundant, and which are accepted as standards for the purpose of consistent database representation. They are often associated with synonym tables that can be used as an adjunct to database query or integration.

ONTOLOGY

In philosophy, the study of the nature of existence. In computer science, a systematically ordered representation of knowledge about a domain, in terms of its objects, concepts and other entities, as well as the variety of relationships among them.

BUSINESS RULES

The implementation in software and databases of the policies, practices, procedures and decision processes of an enterprise, including constraints on data and its legitimate operational uses. Also known as business logic.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Searls, D. Data integration: challenges for drug discovery. Nat Rev Drug Discov 4, 45–58 (2005). https://doi.org/10.1038/nrd1608

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrd1608

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing