Gene Profiling and Microarrays

Purity for clarity: the need for purification of tumor cells in DNA microarray studies


It is now well established that gene expression profiling using DNA microarrays can provide novel information about various types of hematological malignancies, which may lead to identification of novel diagnostic markers. However, to successfully use microarrays for this purpose, the quality and reproducibility of the procedure need to be guaranteed. The quality of microarray analyses may be severely reduced, if variable frequencies of nontarget cells are present in the starting material. To systematically investigate the influence of different types of impurity, we determined gene expression profiles of leukemic samples containing different percentages of nonleukemic leukocytes. Furthermore, we used computer simulations to study the effect of different kinds of impurity as an alternative to conducting hundreds of microarray experiments on samples with various levels of purity.

As expected, the percentage of erroneously identified genes rose with the increase of contaminating nontarget cells in the samples. The simulations demonstrated that a tumor load of less than 75% can lead to up to 25% erroneously identified genes. A tumor load of at least 90% leads to identification of at most 5% false-positive genes. We therefore propose that in order to draw well-founded conclusions, the percentage of target cells in microarray experiment samples should be at least 90%.


The recent introduction of microarrays has revolutionized the biological research field.1, 2, 3 Whereas previously many studies focused on the expression of one gene in one specific cell type, the introduction of microarray technology has given researchers the opportunity to study the expression of thousands of different genes (ie the gene expression profile) in various cell types within a short period of time.4, 5, 6, 7 Accordingly, microarrays can be used to distinguish different cell types based on their gene expression profile. In the field of oncology, the use of microarrays has led to new insights about etiology and pathofysiology of malignancies.6, 8, 9 For various types of cancer, this has led to the proposition of novel subtypes, which might have implications for prognosis and treatment.10, 11, 12 As a consequence, gene expression profiling using microarray analyses is increasingly seen as a possible diagnostic tool.13, 14, 15 In the field of hematological malignancies, several seminal papers have demonstrated that the use of microarrays indeed may provide novel information at diagnosis.16, 17 This type of information may lead to the development of novel diagnostic tools in the future.

One can envision two different ways in which microarray data can be used to identify novel markers with possible diagnostic relevance. First, comparisons between malignant and nonmalignant sample data may result in a list of differentially expressed genes, which are potential cancer-associated markers. These may then be used as diagnostically relevant markers. Second, the microarray data as a whole may be used directly to classify tissue samples as malignant or nonmalignant. In both cases, successful use of microarrays for studying malignant diseases will be dependent on the quality and reproducibility of the entire microarray analysis procedure.18 This means that nonspecific variation (both biological and technical) should be minimized as much as possible. This variation can be influenced by various parameters, for example, the procurement and storage of the clinical cancer sample, the purity of the collected material, the technique used to extract RNA, the procedure to label the RNA, the type of microarray used and the interpretation of the gene expression data.19, 20 Recently, guidelines (MIAME, have been proposed to assure the reproducibility, and hence comparability, of microarray experiments. These guidelines mainly concentrate on the experimental design, RNA extraction and labeling techniques, and the data mining process.

In most cases, investigators are interested in the expression profile of the tumor per se. Although it is possible that other cell types interacting with the tumor (eg a T-cell infiltrate in non-Hodgkin's lymphomas) provide clinically relevant information, in which case one may decide to forego purification. However, in all other cases, the purity of the sample used is an important aspect influencing the quality of microarray experiments. Intuitively, many investigators have recognized this as an important parameter,21, 22, 23 but very few studies have systematically studied this problem. For leukemias, our previous study on diagnosis-relapse comparison in precursor B-ALL has superficially dealt with this issue.24 Here, we present a systematical investigation towards the problem of differential tumor load between two groups of clinical samples to be compared within one study, for example, malignant and nonmalignant samples.

Theoretically, sample impurity can influence measured expression differences in three ways, depending on the type of impurity present in either group:

  • Random impurity: if each sample contains some amount of nontumor cells variable in nature, genes expressed in these cells will give rise to expression levels significantly higher than background. As this random impurity is independent of the signal of the expressed genes in the pure tumor sample, random impurity only increases the variance for those genes, effectively decreasing their signal-to-noise ratio. The statistical methods used to determine significant differential expression should be designed in such a way that they are robust enough to handle this type of noise. Therefore, a priori no significant differences between the average expression levels of the groups are to be expected for genes expressed in nontumor cells. The influence of this noise may at most decrease the number of significantly differentially expressed genes found between samples. This type of impurity usually occurs in any microarray experiment for which samples are collected from identical tissue types, at identical locations in the body, under similar circumstances, and constitutes normal biological variation.

  • Fixed impurity: if samples in both groups contain the same proportion of identical nontumor cells, this may alter the expression levels of genes consistently. Since this alteration is independent of the grouping, it decreases the differences between group means of the altered genes, that is, the signal. Effectively, this decreases only their signal-to-noise ratio. Consequently, at most the percentage of true positives is decreased (leaving the false positive and negative percentages unchanged). An example of this type of impurity is a constant percentage of contaminating T-lymphocytes in precursor-B-ALL.

  • Group-specific impurity: samples in both groups may contain different types of nontumor cells, or one group may contain nontarget cells whereas the other group does not. When the type of nontumor cell is specific to one group, this may lead to significant differences in expression between groups for certain genes. This form of impurity can occur in cases where samples are collected from different tissue types, at different locations in the body or in different circumstances (eg diagnosis/relapse). An example of this type of impurity is the case of different levels of nonleukemic cells in a leukemia, for example, normal immature myeloid cells in a precursor B-ALL sample. Another example is when precursor-B-ALL samples taken from peripheral blood (with contaminating T-lymphocytes) are compared with samples taken from bone marrow (containing contaminating myeloid cells).

Of course, in practice different types of impurity will be present concurrently, and so will have a combined influence on the analysis.

In this article we address different aspects of purification of cell samples for microarray experiments. First we analyzed expression profiles of samples in which we introduced group-specific impurity by contaminating tumor samples with increasing amounts of normal blood leukocytes. Next, we analyzed two leukemia samples, each with and without purification, demonstrating how purification influences microarray analysis. Additionally, we performed computer simulations (in silico experiments) on a number of two-group microarray experiments using different numbers of microarrays per group and different percentages of contaminating cells. We have used computer simulations as an alternative to conducting hundreds of microarray experiments on samples with different levels of purity, as this is costly and extremely labor-intensive. Based on our results we propose guidelines for purification of cell samples in microarray experiments, that aim to focus on the malignant cells only.

Materials and methods

Sample material

Tumor material of three patients was used in this study. One patient was diagnosed with chronic myeloid leukemia (CML); the tumor load was estimated to be 100% (ie over 98%) based on cell count. The samples of two patients with CD7+ T-cell acute lymphoblastic leukemia (T-ALL) had a tumor load of 62 and 89%, respectively (Figure 2a). Mononuclear cells of all samples were obtained by Ficoll-Pacque™ (Amersham Technologies) centrifugation. Additionally, peripheral blood mononuclear cells of a healthy control were used. CML cells were artificially mixed with mononuclear cells of the healthy control. The percentage of tumor cells in the different artificial mixtures was 100, 75, 50 and 25% with a total of 10 × 106 cells per sample.

Figure 2

Purification of the T-ALL1 sample by means of CD7 expression and MACS beads. The effects of purification on gene expression profiling. (a) FACS profiles before purification (tumor load 89%) and after purification (tumor load 98%). (b) Comparison of gene expression profile of purified and nonpurified T-ALL samples of the same patient, for T-ALL1, 89% tumor load before purification and T-ALL2, 62% tumor load before purification. Both T-ALLs were more than 95% CD7+ after purification.

MACS purification and flow cytometric analysis

Half of each of the two CD7-positive T-ALL samples was enriched to obtain tumor loads of more than 95%, by staining with CD7-PE (Becton Dickinson Biosciences, San Jose, CA, USA), subsequently incubating with anti-PE Microbeads (Miltenyi Biotech, Gladbach, Germany) according to the manufacturer's protocol and finally separating with program POSSEL on the AutoMACS (Miltenyi Biotech) according to the manufacturer's protocol. Directly after purification the samples were analyzed with CD7-PE (Becton Dickinson Biosciences), CD5-APC (Becton Dickinson Biosciences) and CD2-FITC (Coulter Corporation, Miami, FL, USA) on the flow cytometer.

RNA extraction, labeling and hybridization to microarray

Total RNA was extracted with the use of the Qiagen RNeasy kit (Qiagen, Valencia, CA, USA); purity and quality was assessed by RNA 6000 Nano assay on the Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA). None of the samples showed degradation (ratio of 28S ribosomal RNA to 18S ribosomal RNA of at least 2) or contamination of DNA. In total, 1 μg of RNA was used for cDNA synthesis and cRNA synthesis. In total, 15 μg cRNA was hybridized to the HG-U133A GeneChip oligonucleotide microarray (Affymetrix, Santa Clara, CA, USA) according to the manufacturer's protocol (701025 Rev 5). Staining, washing and scanning procedures were carried out as described in the GeneChip Expression Analysis technical manual (Affymetrix).

Data analysis

In order to examine the quality of the different arrays, measured intensity values were analyzed using the GeneChip Operating Software (Affymetrix). The scaling factor (<3SD), percentage of probesets present (27.2–33.1%), noise (2620–2900), background (66.57–76.54) and ratio of GAPDH 3′ to 5′ (<1.4), all indicated high quality of samples and an overall comparability. Microarray data were quantile normalised25 and background signal was removed using robust multichip analysis (RMA).26 Arrays were compared based on the perfect match (PM) probe intensity levels only,26 by performing a per-probe set two-way analysis of variance (ANOVA, with factors ‘probe’ and ‘array’). This results both in average expression levels for each probe set and P-values. The latter were adjusted for multiple testing using Šidák step-up adjustment27 and all probe sets with adjusted P-values smaller than 0.05 were considered significantly differentially expressed. All raw microarray data are available at

Computer simulations

To learn about the effect of impurity on the analysis of microarray datasets, we performed a number of simulations. Expression values were simulated for n microarrays, where n{10, 20, 40, 100}. Each microarray contained p=1000 genes. The microarrays fell into two equally sized groups, A and B (eg corresponding to ‘diagnosis’ and ‘relapse’). On arrays in group A, none of the genes were highly expressed; on arrays in group B, 50 genes were highly expressed, simulating differential expression between groups A and B in the target cell type. Levels of both random and group-specific impurities were simulated by expressing 50 (random and fixed, respectively) additional genes at f=0–100% of the expression of a truly expressed gene, leading to simulated target cell type loads ranging between 100% (target cell-specific genes differentially expressed at 100%, impurity-specific genes at f=0%) and 50% (target cell-specific genes differentially expressed at 100%, impurity-specific genes at f=100%). For details of the simulation, see Appendix A.

A number of well-known techniques for comparing and classifying microarray data was applied to the simulated data sets: the t-test,27 significance analysis of microarrays (SAM,28, 29) and prediction analysis of microarrays (PAM,30):

  • In the t-test analysis, an individual test was applied to each gene to find differential expression, yielding a t-statistic. Its null distribution was estimated, for each gene, by randomly permuting group labels, A and B, 1000 times and calculating t-statistics for that gene. The P-value was set to the relative number of permutations for which a larger t-statistic was found than for the original group labeling. Šidák step-up adjustment was then applied to the P-values found, and all genes with an adjusted P-value0.05 were counted as significantly differentially expressed.

  • For SAM, thresholds Δ were tried from the range [0.01, 0.02, …, 0.10, 0.20, …, 2.50, 2.75, …, 5.00]. The Δ giving the lowest positive false discovery rate (pFDR31) was chosen, and all genes with a difference larger than Δ in SAM statistics between the two groups were called significantly differentially expressed.

  • PAM was applied with the same range for the threshold Δ as used for SAM. No per-group scaling was used. The optimal Δ was found by 10-fold cross-validation. Besides a cross-validation error, PAM gives a list of genes useful for classification. Note that this does not entail that these genes need to be significantly differentially expressed. It also does not mean that PAM uses all genes that are significantly differentially expressed.

All simulations were repeated 10 times, with mean and standard deviations indicated in figures. Simulations were performed in the MATLAB programming environment (The Mathworks, Natick, MA, USA) using software developed by the authors. This software can be downloaded from


Impurities introduced in a leukemic sample: CML mixed with normal mononuclear cells

To systematically investigate the influence of impurities in DNA microarray analyses, we used cells derived from a patient with newly diagnosed CML (target) mixed with blood mononuclear cells (nontarget) of a healthy control, introducing group-specific impurity. RNA was extracted from five samples with a tumor load of 100% (twice), 75%, 50% and 25%. We identified the expression profiles of the different samples and compared the profiles of contaminated samples to the samples with 100% tumor load.

Comparison between the two independently processed 100% CML samples (CML1 and CML2) showed no significantly differentially expressed probesets, that is, the technical variation between the two experimental procedures does not lead to erroneous findings. This is in accordance with previous experience concerning independent processing of identical samples (unpublished data from our and other laboratories).

As shown in Figure 1a, many probesets were significantly differentially expressed between the pure and contaminated samples. These could have led to false identification of genes involved in tumorigenesis in CML. For some genes the derivation from contaminating cells may be obvious (eg well-known lineage specific genes), yet many other genes are not known to be associated with nonmyeloid cell lineages and are therefore hard to assign as resulting from impurities. The full list of genes can be found in Supplementary Table 1.

Figure 1

The effect of contaminating nonleukemic cells on gene expression profiling of CML cells: (a) comparison of gene expression profile of 100% CML cells (CML1) to (a) a second 100% CML sample (CML2), (b) both CML1 and CML2 to a 75% pure CML sample, (c) both CML1 and CML2 to a 50% pure CML sample and (d) both CML1 and CML2 to a 25% pure CML sample. Plotted are average log2 expression level (A) and the difference in log2 expression level (M) for each gene. Colors indicate P-values for differential expression; circled genes are significantly differentially expressed. (b) The level of expression (arbitrary units) of identified genes generally increases with increasing degrees of impurity.

When the tumor load was lowered from 100% to 75%, 31 probesets were identified as significantly differentially expressed. Among these were erythrocyte specific genes (eg hemoglobins), genes encoding for growth factors (eg granulin) and genes involved in cell signaling (IL-8, monocyte chemotactic protein).

A tumor load of 50% showed 84 significantly differentially expressed probesets as compared to 100% CML. Genes up- or downregulated in this comparison contained genes encoding for B-cell specific proteins (CD83, CD48), NK-cell-specific proteins (killer cell lectin-like receptor) and MHC class II.

If the tumor cells only accounted for 25%, 134 probesets were up- or downregulated as compared to the 100% tumor sample. Genes found in this analysis included B-cell specific genes (CD83, CD48, immunoglobulins), erythrocyte-specific genes (hemoglobulin, ferritin), genes specific for dendritic cells (dendritic cell-associated C-type lectin-1), NK-cell-specific genes (killer cell lectin-like receptor subfamily B) and genes encoding for cell signaling (small inducible cytokine A7, IL-8, monocyte chemotactic protein 3).

The sets of probesets found to be differentially expressed in the various comparisons overlap to a large degree (see Supplementary Figure 1): 24 of the 31 and 52 of the 84 probesets can be found in the 134. The lack of full overlap is caused by random fluctuations, causing some probesets just significantly differentially expressed at low impurity levels, to be missed at higher impurity levels.

With increasing numbers of nontarget cells, the level of expression of the vast majority (108 of the 134) of the significantly differentially expressed probesets increases monotonically from CML (100%) to CML (25%) (Figure 1b). This clearly indicates that the identified probesets correspond to the impurity introduced.

Naturally occurring impurities in leukemia: purified vs nonpurified T-ALL

To further investigate the influence of purification on expression profiles, we compared two T-ALL samples before purification and after purification (Figure 2a). For both individual T-ALLs, the gene expression profiles of the purified and nonpurified samples were compared. This resulted in a large number of probesets that differed significantly (Figure 2b). In T-ALL1 108 probesets (among which B-cell specific genes) were significantly differentially expressed between purified (more than 95%) and nonpurified (89%) tumor cells. In T-ALL2 the number of significantly differentially expressed probesets between the purified (more than 95%) and nonpurified (62%) samples was 178. Potentially, these could incorrectly be labeled as important in the tumorigenesis of T-ALL.

We realized that the use of antibodies could potentially activate the cells of interest leading to an altered gene expression profile. In our hands, negative selection usually does not lead to the level of purity required (>90%, as we show later on in this report) to make appropriate statements on individual samples. Therefore, positive selection needs to be used. In order to avoid activation problems the following precautions were taking: essentially all purification steps were done cold (4°C), which diminishes activation of cellular signaling pathways, antibodies against strongly activating antigens (Such as C2, CD3, TCR, BCR) were not used, and the samples to be compared (eg diagnosis vs relapse) were purified with the same antibody at the same time in parallel experiments. Together, these precautions should largely diminish effects of the antibody used on gene expression profiles, and if they occur, they are expected not to be detected as differential effects.

Computer simulations

Statistical analysis plays an important role in analysis of microarray data. Various types of algorithms and statistical tools are used to analyze the data. Here we employed three commonly used tools: (1) Student's t-test, which assumes a normal distribution of the data, where one sets a P-value as threshold for identification of statistically relevant genes. (2) SAM (significance analysis of microarrays), a gene selection method that identifies significantly differentially expressed genes based on permutation tests. (3) PAM (prediction analysis of microarrays), a classification tool with built-in predictive gene selection. In order to investigate the issue of impurity from a theoretical point of view, we simulated a number of microarray experiments in silico, considering both random impurities as seen in the processing of identical samples and group-specific impurities as seen in the comparison of malignancies with different fractions of tumor cells.

Computer simulations: random impurities

Figure 3 depicts simulation results for random impurities. It is clear that sample size n played a large role. For n=10, gene selection failed. The t-test selected no genes, whereas SAM and PAM selected a small number of genes, of which a large percentage consisted of false positives. For n=100, gene selection worked well. The t-test and SAM both found small numbers of false positives; SAM had the advantage for n=20 and 40, but for n=100 found slightly more false positives. PAM found no false positives for n=40 and 100, but selected a smaller number of genes on which to base classification. Of course, the specific n for which these sample size effects shown here to occur, depends on the number of genes on the array.

Figure 3

Effect of random impurity on a number of methods for gene selection and classification. Results of the Student's t-test, SAM and PAM for gene selection (a–c) and the classification performances of PAM (d), for sample sizes n of 10, 20, 40 and 100 (left to right) are shown. The bold lines indicate average performance, the shaded areas correspond to a single standard deviation.

It is in the medium sample size area (n=20) that the effect of random impurities is seen most clearly. Gene selection on pure samples here produced a relatively small number of true positives and a small number of false positives. As hypothesized, the main effect of random impurities on gene selection (t-test, SAM) is that, for large amounts of nontarget cells, the number of truly significantly differentially expressed genes found slightly decreased; however, the number of false positives found did not increase significantly. Thus, the methods are robust to noise. The same holds for the classification results of PAM.

Computer simulations: group-specific impurities

Simulation results for group-specific impurities are displayed in Figure 4. The same sample size effects as for random impurities were found here: n=10 is too small, n=40 gave reasonable results and n=100 good results on pure samples. In between these, the gene selection methods (t-test, SAM) found approximately the same numbers of true positives as were found with random impurities. However, as hypothesized, the number of false positives here increased with increasing group-specific impurity. For low impurity levels (up to 90% target cell type load), the effect was still relatively small. For higher levels, the effect was quite dramatic. At a purity of 50%, for n=100 as many false positives were found as true positives: that is, all simulated false positives were identified.

Figure 4

Effect of group-specific impurity on a number of methods for gene selection and classification. Results of Student's t-test, SAM and PAM for gene selection (a–c) and the classification performances of PAM (d), for sample sizes n of 10, 20, 40 and 100 (left to right) are shown. The bold lines indicate average performance, the shaded areas correspond to a single standard deviation.

The classification performances of PAM were good, even for relatively small numbers of microarrays per group (n=20). PAM does not suffer from group-specific impurity, because it selects genes based on classification performance only, regardless of whether they stem from target or nontarget cells. However, many of the genes PAM selected were false positives. As can be seen in Figure 4, a tumor load of less than 75% can lead to up to 25% erroneously identified genes. A tumor load of more than 90% leadsto the identification of less than 5% false-positive genes.


We investigated the issue of differential tumor load and impurities both from an experimental and a computer simulation point of view. In order to conduct a meaningful microarray experiment, the population of interest needs to be accurately defined.32 In oncological studies, generally the population of interest will correspond to a single cell type, although circumstances exist in which nontumor cells provide additional information about tumor characteristics and patient response. For example, the presence of T-cells surrounding breast tumors may be indicative of the level of the immunological response and hence may have prognostic value. However, in most cases, nontumor cells are considered to be impurities, which may hamper the correct interpretation of the gene expression profiles.

The most obvious way to evade impurities in a microarray experiment is purification of the starting population. In leukemia, this is relatively easy to perform, since leukemic cells are clearly distinguishable from nonmalignant cells by their immunophenotype.33, 34, 35 Previously, we purified precursor-B-ALL samples on the expression of CD19 and CD34, which lowered the number of significantly differentially expressed probesets in a comparison between diagnosis and relapse.24 Here, we have specifically focused on the influence of purification on the outcome of microarray experiments, specifically in cases where two (or more) groups are present and gene selection or classification has to be performed. Three possible types of sample impurity can influence microarray data analysis. Random impurities pose a minor problem, as the methods employed in our simulations appeared to be sufficiently robust to handle the noise these impurities introduce. Fixed, identical impurities in all groups (not simulated) may lead to a decrease in true positives, but not to finding false positives. Finally, group-specific impurities, which are a problem in specific research settings, were shown to lead to unreliable results for low purity samples in our simulations: although the number of true positives remained unchanged with increasing purity, the number of false positives dramatically decreased. Our laboratory experiments verified this: group-specific impurities indeed influenced analysis results, and this influence increased with an increase in impurity. This may lead to the incorrect identification of potential diagnostic markers.

Our findings can in principle be applied to other microarray experiments as well, for instance to solid tumors. Although purification of solid tumors is hard to perform, attempts to decrease nontarget cells have been made. The best way to isolate target cells in solid tumors or solid tissue in general is morphological identification by experienced pathologists.36 A disadvantage of morphological identification is the inter- and intraindividual variation between different pathological investigations.37, 38, 39 This can partly be corrected for by immunohistochemistry and determination of additional markers by PCR or cytogenetic techniques.40, 41, 42, 43

Another aspect in purification of tumor cells from solid tissues is the direct contact with nontarget cells. In contrast to leukemic cells in blood and bone marrow, solid tumors are well attached to their surrounding nonmalignant cells by all kinds of cellular interactions.44, 45, 46 Since tumor cells are usually dispersed throughout the tissue of interest, pure cell samples are often difficult to obtain.

In order to overcome the problem of the presence of nontarget cells in solid tissues, several groups have tried to diminish the influence of impurities by manipulating the resulting data sets.47, 48, 49 In one of these studies, cells with different gene expression profiles were used in a linear model in order to predict their influence on the overall gene expression profile.49 Another study performed by Lu et al49 mainly focused on genes already known to have a function in cell cycle. However, these approaches are not readily suitable to discover new genes in specific circumstances or when the amount and nature of contaminating cells are unknown.

Based on our findings we propose the following guidelines to reduce the influence of contaminating cells in leukemia research as much as possible and to enable good comparisons between microarray experiments performed in different experiments. The simulation results indicate that the tumor load in leukemic samples should exceed 90%, provided the right statistical tools are used. At lower tumor loads, purification to more than 90% is needed, but should then be done for all samples under study, to make comparisons possible. Especially, if data from individual samples are analyzed in detail, purification is essential. Classification of samples is less susceptible to contamination problems, but here too, better results can be obtained with high tumor load, specifically when sample sizes are low (10–20) (Figures 3d and 4d). Purification should be based on universally approved biological markers (eg CD markers) and purity should always be checked, preferably by flow cytometry. Furthermore, in order to compare separate cell samples, they should preferably be collected from comparable tissue types, at identical locations in the body, and under comparable circumstances.

These guidelines can help to conduct microarray experiments in a meaningful way and to provide the clinical and research community with reliable results.


  1. 1

    Schena M, Shalon D, Heller R, Chai A, Brown PO, Davis RW . Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proc Natl Acad Sci USA 1996; 93: 10614–10619.

    CAS  Article  Google Scholar 

  2. 2

    DeRisi J, Penland L, Brown PO, Bittner ML, Meltzer PS, Ray M et al. Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nat Genet 1996; 14: 457–460.

    CAS  Article  Google Scholar 

  3. 3

    Schena M, Shalon D, Davis RW, Brown PO . Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995; 270: 467–470.

    CAS  Article  Google Scholar 

  4. 4

    Wurmbach E, Gonzalez-Maeso J, Yuen T, Ebersole BJ, Mastaitis JW, Mobbs CV et al. Validated genomic approach to study differentially expressed genes in complex tissues. Neurochem Res 2002; 27: 1027–1033.

    CAS  Article  Google Scholar 

  5. 5

    Smith JL, Freebern WJ, Collins I, De Siervi A, Montano I, Haggerty CM et al. Kinetic profiles of p300 occupancy in vivo predict common features of promoter structure and coactivator recruitment. Proc Natl Acad Sci USA 2004; 101: 11554–11559.

    CAS  Article  Google Scholar 

  6. 6

    Southern E, Mir K, Shchepinov M . Molecular interactions on microarrays. Nat Genet 1999; 21 (1 Suppl): 5–9.

    CAS  Article  Google Scholar 

  7. 7

    Brown PO, Botstein D . Exploring the new world of the genome with DNA microarrays. Nat Genet 1999; 21 (1 Suppl): 33–37.

    CAS  Article  Google Scholar 

  8. 8

    Janoueix-Lerosey I, Novikov E, Monteiro M, Gruel N, Schleiermacher G, Loriod B et al. Gene expression profiling of 1p35–36 genes in neuroblastoma. Oncogene 2004; 23: 5912–5922.

    CAS  Article  Google Scholar 

  9. 9

    Guipaud O, Deriano L, Salin H, Vallat L, Sabatier L, Merle-Beral H et al. B-cell chronic lymphocytic leukaemia: a polymorphic family unified by genomic features. Lancet Oncol 2003; 4: 505–514.

    Article  Google Scholar 

  10. 10

    Hoefnagel JJ, Dijkman R, Basso K, Jansen PM, Hallermann C, Willemze R et al. Distinct types of primary cutaneous large B-cell lymphoma identified by gene expression profiling. Blood, prepublished online August 12, 2004; doi 10.1182/blood-2004-04-1594.

  11. 11

    Lossos IS, Czerwinski DK, Alizadeh AA, Wechser MA, Tibshirani R, Botstein D et al. Prediction of survival in diffuse large-B-cell lymphoma based on the expression of six genes. N Engl J Med 2004; 350: 1828–1837.

    CAS  Article  Google Scholar 

  12. 12

    Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M et al. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 2000; 406: 536–540.

    CAS  Article  Google Scholar 

  13. 13

    Finley DJ, Zhu B, Barden CB, Fahey III TJ . Discrimination of benign and malignant thyroid nodules by molecular profiling. Ann Surg 2004; 240: 425–436, discussion 427–436.

    Article  Google Scholar 

  14. 14

    Ferrando AA, Neuberg DS, Staunton J, Loh ML, Huard C, Raimondi SC et al. Gene expression signatures define novel oncogenic pathways in T cell acute lymphoblastic leukemia. Cancer Cell 2002; 1: 75–87.

    CAS  Article  Google Scholar 

  15. 15

    Ando T, Suguro M, Kobayashi T, Seto M, Honda H . Multiple fuzzy neural network system for outcome prediction and classification of 220 lymphoma patients on the basis of molecular profiling. Cancer Sci 2003; 94: 906–913.

    CAS  Article  Google Scholar 

  16. 16

    Holleman A, Cheok MH, den Boer ML, Yang W, Veerman AJ, Kazemier KM et al. Gene-expression patterns in drug-resistant acute lymphoblastic leukemia cells and response to treatment. N Engl J Med 2004; 351: 533–542.

    CAS  Article  Google Scholar 

  17. 17

    Valk PJ, Verhaak RG, Beijen MA, Erpelinck CA, Barjesteh van Waalwijk van Doorn-Khosrovani S, Boer JM et al. Prognostically useful gene-expression profiles in acute myeloid leukemia. N Engl J Med 2004; 350: 1617–1628.

    CAS  Article  Google Scholar 

  18. 18

    Holloway AJ, van Laar RK, Tothill RW, Bowtell DD . Options available – from start to finish – for obtaining data from DNA microarrays II. Nat Genet 2002; 32 (Suppl): 481–489.

    CAS  Article  Google Scholar 

  19. 19

    Li Y, Li T, Liu S, Qiu M, Han Z, Jiang Z et al. Systematic comparison of the fidelity of aRNA, mRNA and T-RNA on gene expression profiling using cDNA microarray. J Biotechnol 2004; 107: 19–28.

    CAS  Article  Google Scholar 

  20. 20

    Ojaniemi H, Evengard B, Lee DR, Unger ER, Vernon SD . Impact of RNA extraction from limited samples on microarray results. Biotechniques 2003; 35: 968–973.

    CAS  Article  Google Scholar 

  21. 21

    Mikulowska-Mennis A, Taylor TB, Vishnu P, Michie SA, Raja R, Horner N et al. High-quality RNA from cells isolated by laser capture microdissection. Biotechniques 2002; 33: 176–179.

    CAS  Article  Google Scholar 

  22. 22

    Nakamura T, Furukawa Y, Nakagawa H, Tsunoda T, Ohigashi H, Murata K et al. Genome-wide cDNA microarray analysis of gene expression profiles in pancreatic cancers using populations of tumor cells, normal ductal epithelial cells selected for purity by laser microdissection. Oncogene 2004; 23: 2385–2400.

    CAS  Article  Google Scholar 

  23. 23

    Zhu G, Reynolds L, Crnogorac-Jurcevic T, Gillett CE, Dublin EA, Marshall JF et al. Combination of microdissection and microarray analysis to identify gene expression changes between differentially located tumour cells in breast cancer. Oncogene 2003; 22: 3742–3748.

    CAS  Article  Google Scholar 

  24. 24

    Staal FJ, van der Burg M, Wessels LF, Barendregt BH, Baert MR, van den Burg CM et al. DNA microarrays for comparison of gene expression profiles between diagnosis and relapse in precursor-B acute lymphoblastic leukemia: choice of technique and purification influence the identification of potential diagnostic markers. Leukemia 2003; 17: 1324–1332.

    CAS  Article  Google Scholar 

  25. 25

    Bolstad BM, Irizarry RA, Astrand M, Speed TP . A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics 2003; 19: 185–193.

    CAS  Article  Google Scholar 

  26. 26

    Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP . Exploration, normalization and summaries of high density nucleotide array probe level data. Biostatistics 2003; 4: 249–264.

    Article  Google Scholar 

  27. 27

    Ge Y, Dudoit S, Speed TP . Resampling-based Multiple Testing for Microarray Data Analysis. Department of Statistics, University of California: Berkeley, 2003.

    Google Scholar 

  28. 28

    Tushner FG, Tibshirani R, Chu G . Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001; 98: 5116–5121.

    Article  Google Scholar 

  29. 29

    Storey JD, Tibshirani R . SAM thresholding and false discovery rates for detecting differential gene expression in DNA microarrays. In: Parmigiani G, Garrett ES, Irizarry RA, Zeger SL (eds). The Analysis of Gene Expression Data: Methods and Software. New York: Springer, 2003.

    Google Scholar 

  30. 30

    Tibshirani R, Hastie T, Narasimhan B, Chu G . Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 2002; 99: 6567–6572.

    CAS  Article  Google Scholar 

  31. 31

    Storey JD . A direct approach to false discovery rates. J Roy Statist Soc 2002; Series B: 479–498.

    Article  Google Scholar 

  32. 32

    Potter JD . Epidemiology, cancer genetics and microarrays: making correct inferences, using appropriate designs. Trends Genet 2003; 19: 690–695.

    CAS  Article  Google Scholar 

  33. 33

    Hrusak O, Porwit-MacDonald A . Antigen expression patterns reflecting genotype of acute leukemias. Leukemia 2002; 16: 1233–1258.

    CAS  Article  Google Scholar 

  34. 34

    Pui CH, Behm FG, Crist WM . Clinical and biologic relevance of immunologic marker studies in childhood acute lymphoblastic leukemia. Blood 1993; 82: 343–362.

    CAS  PubMed  PubMed Central  Google Scholar 

  35. 35

    Allsup DJ, Cawley JC . The diagnosis and treatment of hairy-cell leukaemia. Blood Rev 2002; 16: 255–262.

    CAS  Article  Google Scholar 

  36. 36

    Yaziji H, Gown AM . Immunohistochemical analysis of gynecologic tumors. Int J Gynecol Pathol 2001; 20: 64–78.

    CAS  Article  Google Scholar 

  37. 37

    Llewellyn H . Observer variation, dysplasia grading, and HPV typing: a review. Am J Clin Pathol 2000; 114 (Suppl): S21–S35.

    PubMed  Google Scholar 

  38. 38

    Schlemper RJ, Kato Y, Stolte M . Review of histological classifications of gastrointestinal epithelial neoplasia: differences in diagnosis of early carcinomas between Japanese and Western pathologists. J Gastroenterol 2001; 36: 445–456.

    CAS  Article  Google Scholar 

  39. 39

    de Bree E, Koops W, Kroger R, van Ruth S, Witkamp AJ, Zoetmulder FA . Peritoneal carcinomatosis from colorectal or appendiceal origin: correlation of preoperative CT with intraoperative findings and evaluation of interobserver agreement. J Surg Oncol 2004; 86: 64–73.

    Article  Google Scholar 

  40. 40

    Elgamal AA, Holmes EH, Su SL, Tino WT, Simmons SJ, Peterson M et al. Prostate-specific membrane antigen (PSMA): current benefits and future value. Semin Surg Oncol 2000; 18: 10–16.

    CAS  Article  Google Scholar 

  41. 41

    Coindre JM . Immunohistochemistry in the diagnosis of soft tissue tumours. Histopathology 2003; 43: 1–16.

    CAS  Article  Google Scholar 

  42. 42

    Baker M, Gillanders WE, Mikhitarian K, Mitas M, Cole DJ . The molecular detection of micrometastatic breast cancer. Am J Surg 2003; 186: 351–358.

    CAS  Article  Google Scholar 

  43. 43

    Weber T, Klar E . Minimal residual disease in thyroid carcinoma. Semin Surg Oncol 2001; 20: 272–277.

    CAS  Article  Google Scholar 

  44. 44

    Hood JD, Cheresh DA . Role of integrins in cell invasion and migration. Nat Rev Cancer 2002; 2: 91–100.

    Article  Google Scholar 

  45. 45

    Orr FW, Wang HH, Lafrenie RM, Scherbarth S, Nance DM . Interactions between cancer cells and the endothelium in metastasis. J Pathol 2000; 190: 310–329.

    CAS  Article  Google Scholar 

  46. 46

    Malinda KM, Kleinman HK . The laminins. Int J Biochem Cell Biol 1996; 28: 957–959.

    CAS  Article  Google Scholar 

  47. 47

    Tureci O, Ding J, Hilton H, Bian H, Ohkawa H, Braxenthaler M et al. Computational dissection of tissue contamination for identification of colon cancer-specific expression profiles. FASEB J 2003; 17: 376–385.

    CAS  Article  Google Scholar 

  48. 48

    Lu P, Nakorchevskiy A, Marcotte EM . Expression deconvolution: a reinterpretation of DNA microarray data reveals dynamic changes in cell populations. Proc Natl Acad Sci USA 2003; 100: 10370–10375.

    CAS  Article  Google Scholar 

  49. 49

    Stuart RO, Wachsman W, Berry CC, Wang-Rodriguez J, Wasserman L, Klacansky I et al. In silico dissection of cell-type-associated patterns of gene expression in prostate cancer. Proc Natl Acad Sci USA 2004; 101: 615–620.

    CAS  Article  Google Scholar 

  50. 50

    Mansmann U . Issues in planning and analysing microarray data studies, Proc Int Symp on Bioinformatics for Agricultural Biotechnology, Suwan, Korea, 2003.

    Google Scholar 

Download references


We thank Dr E van Wering for providing T-ALL samples.

Author information



Corresponding author

Correspondence to F J T Staal.

Additional information

Supplementary Information

Supplementary Information accompanies the paper on the Leukemia website (

Supplementary information

Appendix A

Appendix A

Microarray data were simulated50 as follows. For n={10, 20, 40, 100} microarrays, two groups of n/2 arrays each, A and B, were simulated. Each array contained 1000 genes, of which 50 were set to be truly expressed in B only. The base log2-expression of a gene g in array a was simulated as follows:

  • calculate an average expression over all arrays mg=log2(mg′) with (mg′)−1Γ(1,1)

  • true differential expression: dg{0,1} with P(dg=1)=0.05

  • sign of expression difference: sg{−1,1} with P(sg=1)=0.5

  • amplitude of expression: cgU(1.4, 1.5)

  • expression:  log2(ea,g)N(mg,1) aA  log2(ea,g)N(mg+dgsgcg, 1) aB

where the subscript g indicates values for gene g over all arrays, subscript a,g denotes values for gene g in array a, P(x) indicates the probability of x occurring, Γ(α,θ) is the Gamma distribution, U(l,r) is the uniform distribution on [l,r], and N(μ,σ) is the Gaussian distribution with mean μ and standard deviation σ.

Next to the 50 truly expressed genes, either 50 random impurity genes or 50 group-specific impurity genes were set to be differentially expressed in arrays, expressed at a fraction f of the level of a truly expressed gene. Random impurity genes g were added as follows:

  • presence of differential expression: da,g{0,1} with P(da,g=1)=0.05

  • sign of expression difference: sa,g{−1,1} with P(sa,g=1)=0.5

  • amplitude of expression difference:

  • ca,gU(1.4,1.5)

  • differential expression: log2(ea,g)=log2(ea,g)+f da,g sa,g ca,gaB

Group-specific impurity genes g were added as follows:

  • presence of differential expression: dg{0,1}, with P(dg=1)=0.05

  • sign of expression difference: sg{0,1}, with P(sg=1)=0.5

  • amplitude of expression difference: ca,gU(1.4,1.5)

  • differential expression: log2(ea,g)=log2(ea,g)+f dg sg ca,gaB

In the simulations, the impurity fraction f was varied between 0.0 and 1.0.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

de Ridder, D., van der Linden, C., Schonewille, T. et al. Purity for clarity: the need for purification of tumor cells in DNA microarray studies. Leukemia 19, 618–627 (2005).

Download citation


  • microarrays
  • T-ALL
  • CML
  • sample purification
  • gene selection
  • classification

Further reading