Main

Clinical informatics and bioinformatics

The decade of the 1940s brought the first electronic digital computers, as well as the first antibiotic, penicillin. Motivated by these revolutionary innovations, by the late 1950s a few biomedical researchers had started to explore the possible utility of digital computers. By the 1960s, there was extensive use of computers in the medical sciences, which are fundamentally information-intensive. The English term medical informatics (a translation from the Russian informatika) first appeared in 1974 because of the need for a name for this area of new biomedical knowledge and because of the lack of a single English term that includes both information (what is processed) and computers (how it is processed). The name also needed to encompass the fields of science, engineering, and technology.1

Bioinformatics, a newly named and rapidly emerging field of biomedical research, has been recognized for about a decade. The emergence of modern bioinformatics obtained enormous insight from carefully constructed clinical genetics databases, such as disease-specific mutation databases and genotype-phenotype analyses. A flood of large-scale genomic and postgenomic data, powered by high-throughput technologies and large-scale databases, means that many of the challenges in biomedical research are now challenges in computational science. Not only are many of the fundamental problems in genomics/proteomics, such as string sequence homology, pattern recognition, structure prediction, and network analysis, the problems of computational science, but so also are the structural, behavioral, and developmental features of living organisms fundamentally informatical phenomena.

Biomedical informatics, the convergence of bioinformatics and clinical informatics, is radically transforming our biomedical understanding much the same way that biochemistry did a generation ago. Some academic institutions have already integrated bioinformatics and clinical informatics programs that have shared areas of research,2,3 core methodologies, challenges, goals, and impact.46 As bioinformatics moves from constructing raw biomolecular data into their biological functions and clinical importance, quality clinical information will become the critical part of further progress. A patient's biomolecular information, such as personal and familial genetic code, will soon be included in his/her electronic medical record as the most predictive clinical information for diagnostics, therapeutics, and prognostics; and this could threaten the right of privacy and confidentiality. Comprehensive integration of bioinformatics and clinical informatics systems, then, will be one of the primary challenges in the next decades.

Accomplishments of bioinformatics and the clinical relevance of biochip informatics

The critical dependence of the success of the Human Genome Project on bioinformatics is just one example of the remarkable accomplishments of bioinformatics. Other areas where bioinformatics has been crucial include sequence alignment of DNA and protein, natural genetic variation, prediction of the structure and function of biological macromolecules, analysis of biomolecular interaction networks, integration of heterogeneous biological databases, biomolecular knowledge representation, simulation of biological processes, analysis of the data created by large-scale biological experiments, and rational drug design.

Most researchers agree that the challenge now is to understand all the data. The speed of data generation now exceeds that of interpretation (i.e., more sequences than related publications in GenBank). This has become even more serious with the introduction of biochips that measure the functional activities of genes and proteins. DNA microarrays are microscopic slides containing a large number of cDNA (or oligonucleotide) samples as fluorescently labeled probes to quantitatively monitor the abundance of transcripts (or mRNAs). An image scanner translates fluorescent intensities into a numerical matrix of expression profiles.

Now that we have comprehensive maps of the human genome and transcriptome and since biochip technology can be applied to cells or tissue samples without pulling genes or proteins from them, we have an astounding technique to address the comprehensive spatial and temporal genomic complexity in living organisms under different experimental conditions. Biochip informatics with comprehensive expression profiling is clearly one of the most direct bridges from biomolecular informatics to clinical medicine and the improvement of diagnostics, therapeutics, and prognostics.

Integrative biochip informatics in functional genomics and proteomics

Biochip informatics: Basic data analysis

Because there are many sources of noise and systematic variability in microarray experiments,7,8 data normalization and preprocessing are crucial in analysis. Normalization includes those transformations that control systematic variabilities within a chip or across multiple chips. The simplest way data normalization can be done is by dividing or subtracting all expression values by a representative value for the system or by a linear transformation to a fixed mean (i.e., 0.0) and unit variance (i.e., 1.0) (sometimes called “median polishing”). However, the linear response between the true expression level and measured fluorescent intensity may not be guaranteed,9,10 especially when dye biases depend on array spot intensity or when multiple print tips are used in the microarray spotter.11

Data preprocessing includes those transformations that prepare the data for the subsequent analysis. Scaling and filtering are the major steps of data preprocessing. A low variation filter to exclude genes that did not change significantly across experiments has been successfully applied in many studies.12 Statistical significance testing, such as the analysis of variance and multiple comparisons, can also be used to filter data that show no significant change across conditions when a sufficient number of repeated observations are available.

The importance of data visualization cannot be overemphasized. It is highly recommended to scatter-plot the data whenever possible. The most straightforward approach to microarray data analysis is to find differentially expressed genes across different experimental conditions.13,14 Standardized expression profiling, consistent database design, and streamlining the experimental process management are all crucial,15,16 as are the supervised and unsupervised machine-learning algorithms that make sense of the mountains of genomic data. Here now is a brief description of the various machine-learning approaches to deciphering genomic data.

Biochip informatics: Functional clustering and machine-learning approaches

A general question in many research areas is how to organize observed data into meaningful structures. One common difficulty in biochip data analysis is the very high dimensionality of the data. The data projection method reduces high dimensionality and projects complex data structure onto a lower dimensional space. Cluster analysis, by reducing dimensionality, creates hypothesized clusters and helps researchers infer unknown functions of genes based on the assumption that a group of genes with similar expression profiles may be functionally associated.

Principal component analysis, a statistical approach to reduce dimensionality without losing significant information by paying attention only to those dimensions that account for large variance in the data, has been applied to microarray data analysis.17,18 Mutidimensional scaling, a data projection method originally developed in mathematical psychology,19 has also been shown to be a powerful tool in functional genomics research.20

Cluster analysis is currently the most frequently used multivariate technique to analyze microarray data. Clusters can be developed using a variety of similarity or distance metrics: Euclidean distance, correlation coefficients, or mutual information. Hierarchical tree clustering joins similar objects together into successively larger clusters in a bottom-up manner (i.e., from the leaves to the root of the tree), by successively relaxing the threshold of joining objects or sets (Fig. 1).21,22 The relevance-networks approach takes the opposite strategy.23 It starts with a completely connected graph with the vertices representing each object and the edges representing a measure of association, and then links are increasingly deleted to reveal “naturally emerging” clusters at a certain threshold.

Fig. 1
figure 1

Cluster analysis and graphical display of genome-wide expression patterns (Jurkat T cells under gamma irradiation). (a) Hierarchical clustering creates functional clusters with color-coded expression patterns. (b) Partitional clusters with geometric grid structure are created by self-organizing maps.

Partitional clustering algorithms, such as K-means analysis and self-organizing maps,24 which minimize within-cluster scatter or maximize between-cluster scatter, were shown to be capable of finding meaningful clusters from functional genomic data (Fig. 1).25,26 Creation of a hierarchical-tree structure in a top-down fashion (i.e., from the root to the leaves of the tree) by successive “optimal” binary partitioning based on graph theory27 and geometric space-partitioning principle28 has also been introduced.

The “optimal” partitioning problem (i.e., the best clustering) is fundamentally NP-hard and can be viewed as an optimization problem. Most of the meta-heuristic algorithms, such as simulated annealing and genetic algorithm29 and model-based search,30 can all be applied to attain better understanding of the complex data structure of genomic-scale expression profiles. The reliability and quality measures of clusters, as well as multilevel visualization for the evaluation of clustering solutions, should be addressed.31,32

Integrative biochip informatics

Exploratory data analysis, such as clustering, is appropriate when there is no a priori knowledge about the area of research. Such a technique is known as unsupervised machine learning in the artificial intelligence community. With increasing knowledge of complex biological systems, supervised machine-learning techniques (or classification algorithms) are also being increasingly introduced into functional genomics with significant success.33,34

In addition to clustering and classifying expression profiles (or unsupervised and supervised machine learning), systematic integration and streamlining of appropriate informatics technologies can greatly enhance the productivity of functional genomics research. For example, PubGene35 links gene expression profiles to biomedical literature by combining gene ontology and text mining techniques applied to the PubMed database (http:/http://www.pubgene.org). A variety of meta-databases36 and natural language processing techniques37 are being applied to extract biomolecular interaction networks from biomedical literature and factual databases. Linking this information to genetic regulatory network and metabolic pathway information like KEGG is undergoing vigorous research. Structural sequence information can be used to greatly enhance functional understanding.38,39

At the Harvard Medical School–affiliated Children's Hospital in Boston, we have also developed automatic annotation machines for each microarray probe by integrating many of the publicly available bioinformatics databases. An automated inference engine to predict the functional annotation of genes works together with all the streamlined biochip informatics technologies, including basic data analysis, functional clustering, and supervised classification algorithms. The management of integrated databases, as well as intelligent modules, is becoming more important and challenging. We are currently integrating these biochip informatics technologies into the advanced clinical information systems at Children's Hospital.

Biomedical informatics: The emergence of new medicine

Large areas of medical research and biotechnological development will be permanently transformed by the evolution of high-throughput techniques and informatics. Biochip technology is one of the most readily applicable bioinformatics innovations to biomedical research and clinical medicine. It has been demonstrated that certain types of cancer can be classified by large-scale gene expression profiling.40 The capability of new disease class discovery, as well as prognostic prediction, has also been demonstrated.41 Drug discovery is being transformed by developments in molecular cell biology and bioinformatics.42

The spectacular capability of biochip technology to aid clinical medicine is no wonder considering that, essentially, the technology simultaneously performs tens of thousands of molecular marker studies with comprehensive sets of the biologically most informative molecules, genes, and proteins, in a very systematic and quantitative fashion. By doing so, biochip technology uncovers the molecular basis of histopathological processes, the fundamentals of modern diagnostics.

Bioinformatics will not replace experiments, but miniaturization and automation of laboratory processes can streamline and enable the discovery process to an extraordinary degree. Integrating quality clinical information is crucial to achieve real improvements in clinical diagnostics, therapeutics, and prognostics. Thus bioinformatics is not merely a tool to assist the discovery process; it becomes an integral part of discovery and in this way will permanently transform the structure of our biomedical knowledge bases.

The weaving of the horizontally integrated “omic” revolution of all biological building blocks (genome, transcriptome, proteome, metabolome, and biome) with the vertical integration of biomedical informatics [molecular bioinformatics, computational cell biology,43 computational physiology44 (neuroinformatics),45 digital anatomy46 (structural informatics), chemoinformatics,47,48 clinical informatics,49 and public health informatics50] has come of age. The new medicine will be both molecularly informed and informatically empowered.