Introduction

Conventional diagnosis of cancer relies on macro- and microscopic histology and tumor morphology. This methodology is somewhat subjective and depends on highly trained pathologists. Furthermore, there is a wide spectrum in cancer morphology and many tumors are atypical or lack morphologic features that are useful for different diagnosis.1 Recent years witnessed an increasing interest in changing the basis of tumor classification from morphologic classification to molecular genetics-based classification. The rapid development of microarray technologies that can simultaneously assess the expression level of thousands of genes offers the promise of precise, objective and systematic human cancer classification using molecular diagnosis. Many techniques have been used to analyze gene expression data and have demonstrated the potential power of expression profiling for tumor classification (see for review Simon et al2).

Independent component analysis (ICA) is a dimension reduction technique that uses the existence of independent factors in multivariate data and decomposes an input data set into statistically independent components. ICA can reduce the effects of noise or artifacts of the signal and is ideal for separating mixed signals.3 ICA has been used successfully in electroencephalographic (EEG), magnetoencephalographic (MEG) and functional magnetic resonance imaging (fMRI) data.4, 5, 6 Recently, Liebermeister7 used ICA for microarray analysis to extract expression modes of genes. Lee and Batzoglou8 conducted a systematic analysis of the applicability of ICA to microarray data. Moreover, a recent report9 indicated that ICA could improve the biological validity of the genes identified as differentially expressed in endometrial carcinoma, compared to other techniques such as Cyber-T (a Bayesian framework for the analysis of microarray expression data using t-test, http://visitor.ics.uci.edu/genex/cybert/index.shtml) and significance analysis of microarrays (SAM, http://www-stat.stanford.edu/~tibs/SAM/). In this study, we developed an ICA-based algorithm for classifying tissues on the basis of gene expression data. Different from previous methods, our method identified not only a set of biomarkers but also a set of specific diagnosis patterns of normal and tumor samples corresponding to these biomarkers. Using this method, we analyzed colon and prostate cancer data and demonstrated that this method outperformed previous studies.

Data and methods

Microarray data sets

The gene expression data sets from colon and prostate cancers were investigated in this study. For colon cancer, the data set is the expression profiles of 2000 genes using Affymetrix Hum6000 arrays in 22 normal and 40 colon cancer tissue samples10 (the normalized data set can be downloaded at http://microarray.princeton.edu/oncology/affydata/index.html). For prostate cancer, three data sets were used, which are the expression profiles of 12 600 genes using Affymetrix U95Av2 arrays. The first data set consists of 50 normal and 52 prostate cancer tissue samples; the second data set includes 10 nonrecurrent and eight recurrent prostate cancer tissue samples (Department of Adult Oncology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA).11 The third data set is an independent data set of nine normal and 25 tumor prostate samples from another laboratory (Genomics Institute of the Novartis Research Foundation, San Diego, CA 92121, USA).12 The normalized data sets can be downloaded at http://www-genome.wi.mit.edu/MPR/Prostate.

Mathematical framework of ICA

Given a microarray data set X=(xij)m × n=(x1, …, xm)T (T means transpose) with m rows of genes and n columns of samples (ie n different experimental conditions), each element xij in the matrix X corresponds to the ith gene's expression level in the jth sample. If the expressions of m genes are governed by k independent biological processes, such as ribosome biogenesis, cell cycle, etc, then S=(s1, …, sk)T (k⩽m). We assume that the expression of each gene xi (i=1, …, m) is a linear combination of the k independent biological processes sj (j=1, …, k) with some unknown mixing coefficients aij: xi=Σjaijsj, written in the form of matrix representation

or

A is called the mixing matrix and S is called source signals. The goal of ICA is to find a matrix W that satisfies the transformation equation

W is called separating matrix and Y=(y1, …, yk), called independent components, has statistically independent components. Generally speaking, Y is a close approximation of source signal S; if W=A−1, it achieves perfect reconstruction Y=S. To find such a matrix W, an important assumption is that at most one source signal has a Gaussian distribution. This is not a problem for analyzing biological data based on the fact that the most typical Gaussian source is random noise and biological processes are expected to be highly nonrandom, that is, non-Gaussian; for example, in the regulation of gene expression, a set of relevant genes are sharply affected and most other genes are relatively unaffected.8

ICA-based diagnosis algorithm

Our statistical algorithm is a combination of the following sequential steps.

Data preprocessing

Prior to further analysis, log 2 transformation was performed on the colon data. Because too many genes (12 600) are included in the prostate data, we first removed those genes whose expression level is less than 2, retaining the most significant 1662 genes, and then log 2 transformation was performed.

Sampling

For diagnostic purposes, 50% of the samples are randomly selected from normal and tumor samples as the training data set and the remaining data constitute the test data set.

Extraction of independent components

The FastICA program (http://www.cis.hut.fi/projects/ica/fastica/) was used here. If Xnormal represents the normal training data set, we performed ICA on the transpose of data matrix Xnormal and extracted one independent component ICnormal with dimension m × 1 (corresponding to the largest one and accounting for 70% of the variance). Similarly, if Xtumor represents the tumor training data set, we performed ICA on the transpose of data matrix Xtumor and extracted one independent component ICtumor with dimension m × 1 (corresponding to the largest one and accounting for 70% of variance). Here, after the experimental noise was reduced by ICA, we expected ICnormal and ICtumor to represent the characteristic expression profile of all genes in normal samples and tumor samples, respectively, for extraction of two diagnostic patterns. The MATLAB code using the FastICA program was as follows:

Biomarkers selection/test data set validation

We first calculate the ratios using independent component values (loads):

Then, we performed biomarker selection. Specifically, the selection procedure of a subset of biomarkers starts from a pair of genes with the smallest and the largest R values, for example, genes i and j, and the corresponding loads in two independent components ICnormal and ICtumor constitute two discriminant vectors:

Subsequently, the two vectors are used for discriminant analysis: taking a sample from the test data set (ie the remaining 50% of the data), the expression intensities of the above two genes constitute a test vector:

Calculate its distance (this refers to the Euclidean distance between vectors) with the two discriminant vectors obtained above. If the distance between Vtest and Vnormal was smaller than the distance between Vtest and Vnormal, then the sample is classified as normal sample, otherwise as tumor sample. After all samples from the test data set had been evaluated following this procedure, we checked whether the classification rate based on the two genes achieved any user-predefined classification accuracy (eg 90–100%). If so, both genes were selected as biomarkers, if not, we included one more pair of genes in the study, those having the second smallest and the second largest R values. We then checked whether the four genes could achieve the user-defined classification accuracy in the same way as above. If so, the four genes were selected as biomarkers, if not, the whole procedure was repeated till we reached a set of genes achieving the required classification performance (90–100%). They were then selected as biomarkers.

Leave-one-out crossvalidation

In order to get an unbiased estimation of the error rate associated with the method, a commonly used statistical approach,2 leave-one-out (or jacknife) crossvalidation, was employed. This method involves randomly withholding one of the samples analyzed, including both training and test data sets, building a predictor based only on the remaining samples, and then predicting the class of the sample left out. The process is repeated for each sample, and the cumulative error rate is calculated. If the final cumulative error rate was <10%, the leave-one-out crossvalidation was considered completed, otherwise, we repeated the above biomarker selection process to get another subset of biomarkers and did crossvalidation again, till the error rate was less than 10%.

Diagnostic pattern

After the above steps were completed, we obtained a final set of biomarkers that meet our two requirements: (a) the classification accuracy is >90% for the test data set; (b) the error rate is <10% in leave-one-out crossvalidation. The loads of these biomarkers in the original independent component obtained from normal (respectively, tumor) samples ICnormal (ICtumor) constituted the diagnostic pattern for normal (respectively, tumor) tissues.

Diagnosis

Given a sample from the test data set or other independent data set, we simply calculate its distance (Euclidean distance between vectors) to the two diagnostic patterns obtained above, then compare the two distances. If the sample under investigation is closer to the normal pattern than to the tumor pattern, then it is diagnosed as normal sample, otherwise as tumor sample.

The schematic illustration of ICA-based diagnosis procedure is shown in Figure 1.

Figure 1
figure 1

Scheme for ICA-based diagnostic method (see text for details).

Results

We first applied our algorithm to colon microarray data and obtained three diagnostic models (Table 1). The correct prediction rate ranged from 90 to 100%. This means that we could achieve 100% prediction accuracy using 10 genes. Among these, five are overexpressed in normal tissues: 1843 (Gelsolin), 1423 (Myosin regulatory light chain 2), 897 (Complement factor D), 1387 (Phosphoenolpyruvate carboxykinase) and 1635 (Vasoactive intestinal peptide (VIP)); the remaining five genes are upregulated in tumor tissue samples: 1810 (Pancreatic stone protein (PSP)), 1473 (Translational initiation factor 2), 1771 (Aspartyl-tRNA synthetase alpha-2), 1935 (Wee1 hu gene) and 1671 (Human monocyte-derived neutrophil-activating protein (MONAP)).

Table 1 ICA-based diagnosis for colon cancer

Secondly, we applied our algorithm to prostate cancer data. Similarly, we built three diagnostic models (Table 2). Interestingly, the fewer genes in model 3 achieve higher prediction accuracy as compared to model 2. This shows that we could use as few as four genes to completely diagnose test samples with 100% accuracy. These four genes included two downregulated genes in tumor samples (9850 (Human adipsin/complement factor D) and 6715 (an unknown gene from human melanocyte)) and two upregulated genes in tumor samples (10875 (Human prostate carcinoma tumor antigen (PCTA-1)) and 6185 (Serine protease hepsin)).

Table 2 ICA-based diagnosis for prostate cancer

To test the usability of our algorithm, we next applied the present three diagnostic models to prostate cancer to diagnose an independent data set of nine normal and 25 tumor prostate samples from Genomics Institute of the Novartis Research Foundation, San Diego, CA 92121, USA.12 There was a nearly 10-fold difference in the overall microarray intensity between this data set and the previous data set. There was a small difference in the number of genes in the study although they were using the same Affymetrix Hum95Av2 array:11 12 600 genes in the previous data set and 12 626 genes in the present data set; so the 26 extra genes were removed (see Supplementary Information). As an illustration, the diagnostic process using model 2 in prostate cancer (Table 2) was performed as follows:

(a) Given a sample from the independent data set of nine normal and 25 tumor prostate samples, for example, No. 4 normal sample, we included the expression values of only six genes in model 2 (9850, 11052, 4525, 5398, 4483, 6185) (Table 2) as a test vector:

(b) Calculating its Euclidean distances to two diagnostic expression patterns on the six genes of interest (Table 2)

we obtained the distance between Vtest and Vnormal d1=336 and the distance between Vtest and Vtumor d2=377.

(c) Obviously, d2>d1, so we can diagnose the No. 4 sample as a normal sample, as expected.

The diagnostic results using three models on the independent data set are summarized in Table 2 (the diagnostic data are available in Supplementary Information). The results reveal that our three models also perform well with high accuracy (88.2% for model 1, 97.1% for model 2 and 91.2% for model 3).

Finally, we applied the present three diagnostic models to diagnose 10 nonrecurrent and eight recurrent prostate cancer samples. The result demonstrates that our three models also have very good performance: 88.9% for model 1, 83.3% for model 2 and 88.9% for model 3 (the diagnostic data are available in Supplementary Information). Thus, our diagnostic model is capable of diagnosing a tumor from unknown prostate samples including nonrecurrent or recurrent prostate cancer with excellent accuracy.

However, it remains possible that these apparently good models were obtained from the data set by chance. We therefore asked what could be the probability for successfully obtaining the diagnostic models presented above by chance alone? To explore this issue, we ran the program 1000 times, based on 1000 permutations of normal and tumor sample tags. For each permutation, we tried to find a diagnostic model using the same number of genes (4–6 genes for prostate cancer and 4–10 genes for colon cancer). The outcome of this control study was that 21 and 16 of the 1000 permutations generated models with >90% crossvalidation accuracy for prostate and colon cancers, respectively. This shows that the probability for obtaining the above diagnostic models by chance alone was of the order of 1.6–2.1%.

Discussion

The very large amount of gene expression information and noisy data provided by microarray technology leads to difficulties in both basic research and clinical applications. ICA may be an ideal technique for reducing the dimension of data and for separating the experimental noise from expression data.8 The biomarkers selected by our ICA-based method should be examined for their roles in disease etiology. A total of 12 and nine different genes were identified for colon cancer (Table 1) and prostate cancer (Table 2), respectively. Among these genes, the majority of colon cancer genes were identified and discussed in previous studies,13, 14, 15 except two genes, Integrin Alpha 6 and Wee1 Hu genes, which are newly identified by the present ICA study. In contrast, for prostate cancer, only one gene, Hepsin, was identified and discussed by Welsh et al;12 the remaining eight genes are newly identified by the present ICA study: Adipsin/complement factor D, KET protein, Insulin-Like growth factor 2, Elongation factor 1 alpha-2, PCTA-1, Human beta-migrating plasminogen activator inhibitor I, Alpha-methylacyl-CoA racemase (AMACR) and an unknown gene from human melanocyte. These genes have been implicated in human cancers (see Supplementary Information).

It should be emphasized that the genes identified by ICA may not be genes with significantly differential expression between normal and tumor samples. For example, while it has already been investigated as a possible cause for predisposition to prostate cancer but without analysis of its regulation,16 the average expression level of PCTA-1 in tumor samples is about 1.5-fold its average expression level in normal samples. Naturally, such a gene will be eliminated if using the conventional two-fold selection criteria. While ICA identified this gene as a biomarker, and model 3 including this gene can achieve 100% classification accuracy for prostate cancer (Table 2). In fact, PCTA-1 encodes a member of the galectin family. The galectins have been implicated in many essential functions including development, differentiation, cell–cell adhesion, cell–matrix interaction, growth regulation, apoptosis and RNA splicing. PCTA-1 has been considered as a surface marker associated with prostate cancer.17

Here, we obtained three models for both colon and prostate cancer. However, other models reached similar classification accuracy, as Xiong et al18 indicated that the optimal or near-optimal sets of genes for classifying tumor and normal tissues are not unique. This just reflects the complexity of microarray data. We expected, as seen in Table 1, that the more the genes were included in the model, the better the accuracy of the classification was. In contrast, however, Table 2 displays another situation where a model with fewer genes could get better accuracy (model 3 vs model 2). This prompted us to make further tests: interestingly, we found that as more genes are included in the model, the classification accuracy is not systematically increased as expected. Similar results were found in previous reports.14, 19 This highlights the conclusion that no correlation between the number of genes and the classification accuracy exists. This can be accounted for by the fact that not all genes are equally relevant to the normal vs tumor sample discrimination, while the inclusion of more genes introduces more high-dimensional noise, hence decreasing the classification accuracy. As a matter of fact, it is a well-known statistical property that too many feature variables (genes) can harm a discriminator's performance.20 This suggests that we should favor methods that try to extract as few genes as possible as biomarkers. The present work is an attempt in this direction.

Among the prostate cancer patients, up to 30% of them undergoing radical prostatectomy will relapse. The challenge is therefore to identify patients at risk for relapse so that relevant targeted therapeutics could be focused on that particular subpopulation. In the present study, by ICA we identified a few new biomarkers of the disease, and using their expression patterns we could achieve good diagnosis for the nonrecurrent/recurrent data set. This should be helpful for finding robust prognostic markers that are capable of identifying patients at risk of relapse following local therapy.

Finally, we compared other techniques using the same colon cancer data set. Table 3 shows that the present ICA-based algorithm offers a significant advantage over the other methods, which classified two subgroups (tumor vs normal) but do not provide a specific expression pattern of few genes for diagnosis. In contrast, our method not only achieves good classification accuracy, but, because it uses ICA, explicitly extracts a specific expression pattern, thus allowing one to perform diagnosis for any blind sample through a small microarray of few genes. As seen above, for prostate cancer, we have used three different diagnostic expression patterns, obtained from the first laboratory data set, to directly diagnose an independent data set from another laboratory. The performance of the diagnosis was remarkably consistent. However, we need to explore more data from different laboratories to validate these diagnostic models. In conclusion, diagnostic patterns obtained from as many clinical samples as possible using our ICA-based method should be undertaken, in order to clinically provide quantitative diagnostic information for an individual cancer patient.

Table 3 Comparison of ICA-based algorithm with other techniques used on the same colon cancer data set