The properties of high-dimensional data spaces: implications for exploring gene and protein expression data

Key Points

  • The application of several high-throughput genomic and proteomic technologies to address questions in cancer diagnosis, prognosis and prediction generate high-dimensional data sets.

  • The multimodality of high-dimensional cancer data, for example, as a consequence of the heterogeneous and dynamic nature of cancer tissues, the concurrent expression of multiple biological processes and the diverse and often tissue-specific activities of single genes, can confound both simple mechanistic interpretations of cancer biology and the generation of complete or accurate gene signal transduction pathways or networks.

  • The mathematical and statistical properties of high-dimensional data spaces are often poorly understood or inadequately considered. This can be particularly challenging for the common scenario where the number of data points obtained for each specimen greatly exceed the number of specimens.

  • Data are rarely randomly distributed in high-dimensions and are highly correlated, often with spurious correlations.

  • The distances between a data point and its nearest and farthest neighbours can become equidistant in high dimensions, potentially compromising the accuracy of some distance-based analysis tools.

  • Owing to the 'curse of dimensionality' phenomenon and its negative impact on generalization performance, for example, estimation instability, model overfitting and local convergence, the large estimation error from complex statistical models can easily compromise the prediction advantage provided by their greater representation power. Conversely, simpler statistical models may produce more reproducible predictions but their predictions may not always be adequate.

  • Some machine learning methods address the 'curse of dimensionality' in high-dimensional data analysis through feature selection and dimensionality reduction, leading to better data visualization and improved classification.

  • It is important to ensure that the generalization capability of classifiers derived by supervised learning methods from high-dimensional data before using them for cancer diagnosis, prognosis or prediction. Although this can be assessed initially through cross-validation methods, a more rigorous approach is needed, that is, to validate classifier performance using a blind validation data set(s) that was not used during supervised learning.

Abstract

High-throughput genomic and proteomic technologies are widely used in cancer research to build better predictive models of diagnosis, prognosis and therapy, to identify and characterize key signalling networks and to find new targets for drug development. These technologies present investigators with the task of extracting meaningful statistical and biological information from high-dimensional data spaces, wherein each sample is defined by hundreds or thousands of measurements, usually concurrently obtained. The properties of high dimensionality are often poorly understood or overlooked in data modelling and analysis. From the perspective of translational science, this Review discusses the properties of high-dimensional data spaces that arise in genomic and proteomic studies and the challenges they can pose for data analysis and interpretation.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Cluster separability in data space.
Figure 2: High dimensional expression data are multimodal.
Figure 3: Model fitting, dimensionality and the blessings of smoothness.
Figure 4: The curse of dimensionality and the bias or variance dilemma.
Figure 5: Dimensionality reduction.

References

  1. 1

    Khan, J. et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Med. 7, 673–679 (2001). Example of the successful use of molecular profiling to improve cancer diagnosis.

    CAS  PubMed  Article  Google Scholar 

  2. 2

    Bhanot, G., Alexe, G., Levine, A. J. & Stolovitzky, G. Robust diagnosis of non-Hodgkin lymphoma phenotypes validated on gene expression data from different laboratories. Genome Inform. 16, 233–244 (2005).

    CAS  PubMed  Google Scholar 

  3. 3

    Lin, Y. H. et al. Multiple gene expression classifiers from different array platforms predict poor prognosis of colorectal cancer. Clin. Cancer Res. 13, 498–507 (2007).

    CAS  PubMed  Article  Google Scholar 

  4. 4

    Lopez-Rios, F. et al. Global gene expression profiling of pleural mesotheliomas: overexpression of aurora kinases and P16/CDKN2A deletion as prognostic factors and critical evaluation of microarray-based prognostic prediction. Cancer Res. 66, 2970–2979 (2006).

    CAS  PubMed  Article  Google Scholar 

  5. 5

    Ganly, I. et al. Identification of angiogenesis/metastases genes predicting chemoradiotherapy response in patients with laryngopharyngeal carcinoma. J. Clin. Oncol. 25, 1369–1376 (2007).

    CAS  PubMed  Article  Google Scholar 

  6. 6

    Ayers, M. et al. Gene expression profiles predict complete pathologic response to neoadjuvant paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide chemotherapy in breast cancer. J. Clin. Oncol. 22, 2284–2293 (2004).

    CAS  PubMed  Article  Google Scholar 

  7. 7

    Van't Veer, L. J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002). Example of the use of molecular profiling for prognosis that led to the MammaPrint classification scheme for breast cancer.

    CAS  Article  Google Scholar 

  8. 8

    van de Vijver, M. J. et al. A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med. 347, 1999–2009 (2002).

    CAS  PubMed  Article  Google Scholar 

  9. 9

    Gomez, B. P. et al. Human X-Box binding protein-1 confers both estrogen independence and antiestrogen resistance in breast cancer cell lines. FASEB J. 21, 4013–4027 (2007).

    CAS  PubMed  Article  Google Scholar 

  10. 10

    Meleth, S. et al. Novel approaches to smoothing and comparing SELDI TOF spectra. Cancer Inform. 1, 78–85 (2005).

    CAS  PubMed  Article  Google Scholar 

  11. 11

    Satagopan, J. M. & Panageas, K. S. A statistical perspective on gene expression data analysis. Stat. Med. 22, 481–499 (2003).

    PubMed  Article  Google Scholar 

  12. 12

    Allison, D. B., Cui, X., Page, G. P. & Sabripour, M. Microarray data analysis: from disarray to consolidation and consensus. Nature Rev. Genet. 7, 55–65 (2006).

    CAS  PubMed  Article  Google Scholar 

  13. 13

    Slonim, D. K. From patterns to pathways: gene expression data analysis comes of age. Nature Genet. 32, 502–508 (2002).

    CAS  PubMed  Article  Google Scholar 

  14. 14

    Liang, M. P., Troyanskaya, O. G., Laederach, A., Brutlag, D. L. & Altman, R. B. Computational functional genomics. Signal Processing Magazine IEEE 21, 62–69 (2004).

    Article  Google Scholar 

  15. 15

    Tononi, G., Sporns, O. & Edelman, G. M. Measures of degeneracy and redundancy in biological networks. Proc. Natl Acad. Sci. 96, 3257–3262 (1999). Application of concepts from information theory to explore the complexity and robustness of biological networks.

    CAS  PubMed  Article  Google Scholar 

  16. 16

    Clarke, R., Leonessa, F., Welch, J. N. & Skaar, T. C. Cellular and molecular pharmacology of antiestrogen action and resistance. Pharmacol. Rev. 53, 25–71 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17

    Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA 95, 14863–14868 (1998).

    CAS  Article  Google Scholar 

  18. 18

    Perou, C. M. et al. Molecular portraits of human breast tumours. Nature 406, 747–752 (2000).

    CAS  PubMed  Article  Google Scholar 

  19. 19

    Jain, A. K., Duin, R. P. W. & Mao, J. Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Machine Intell. 22, 4–37 (2000).

    Article  Google Scholar 

  20. 20

    Andersson, A. et al. Microarray-based classification of a consecutive series of 121 childhood acute leukemias: prediction of leukemic and genetic subtype as well as of minimal residual disease status. Leukemia 21, 1198–1203 (2007).

    CAS  PubMed  Article  Google Scholar 

  21. 21

    Gu, Z. et al. Association of interferon regulatory factor-1, nucleophosmin, nuclear factor-κB, and cyclic AMP response element binding with acquired resistance to faslodex (ICI 182,780). Cancer Res. 62, 3428–3437 (2002).

    CAS  PubMed  Google Scholar 

  22. 22

    Huang, E. et al. Gene expression phenotypic models that predict the activity of oncogenic pathways. Nature Genet. 34, 226–230 (2003).

    CAS  PubMed  Article  Google Scholar 

  23. 23

    Golub, T. R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999).

    CAS  PubMed  Article  Google Scholar 

  24. 24

    Frey, B. J. & Dueck, D. Clustering by passing messages between data points. Science 315, 972–976 (2007).

    CAS  PubMed  Article  Google Scholar 

  25. 25

    Miller, D. J., Pal, S. & Wang, Y. Constraint-based transductive learning for distributed ensemble classification. Proc. 16th IEEE Workshop Machine Learning Signal Processing 15–20 (2006).

  26. 26

    Tritton, T. R. & Yee, G. The anticancer agent adriamycin can be actively cytotoxic without entering cells. Science 217, 248–250 (1982).

    CAS  Article  Google Scholar 

  27. 27

    Gewirtz, D. A. A critical evaluation of the mechanisms of action proposed for the antitumor effects of the anthracycline antibiotics adriamycin and daunorubicin. Biochem. Pharmacol. 57, 727–741 (1999).

    CAS  Article  PubMed  Google Scholar 

  28. 28

    Feldman, L. D., Hortobagyi, G. N., Buzdar, A. U., Ames, F. C. & Blumenschein, G. R. Pathological assessment of response to induction chemotherapy in breast cancer. Cancer Res. 46, 2578–2581 (1986).

    CAS  PubMed  Google Scholar 

  29. 29

    Chollet, P. et al. Clinical and pathological response to primary chemotherapy in operable breast cancer. Eur. J. Cancer 33, 862–866 (1997).

    CAS  PubMed  Article  Google Scholar 

  30. 30

    Chollet, P. et al. Prognostic significance of a complete pathological response after induction chemotherapy in operable breast cancer. Br. J. Cancer 86, 1041–1046 (2002).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  31. 31

    Bellman, R. Adaptive Control Processes: A Guided Tour (Princeton Univ., Princeton, 1961).

    Google Scholar 

  32. 32

    Chavez, E. & Navarro, G. In Algorithm Engineering and Experimentation (eds Buchsbaum, A. L. & Snoeyink, J.) 147–160 (Springer, Heidelberg, 2001).

    Google Scholar 

  33. 33

    Scott, D. W. Multivariate Density Estimation (John Wiley, Hoboken, 1992).

    Google Scholar 

  34. 34

    Hinneburg, A. & Keim, D. A. Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. Proc. 25th Int. Very Large Database Conf. 506–517 (1999).

  35. 35

    Dudoit, S., Fridlyand, J. & Speed, T. P. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 97, 77–87 (2002).

    CAS  Article  Google Scholar 

  36. 36

    Haykin, S. Neural Networks: A Comprehensive Foundation (Prentice Hall, Upper Saddle River, 1999).

    Google Scholar 

  37. 37

    Lee, M. L., Kuo, F. C., Whitmore, G. A. & Sklar, J. Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl Acad. Sci. USA 97, 9834–9839 (2000).

    CAS  PubMed  Article  Google Scholar 

  38. 38

    Jain, A. K. & Chandrasekaran, B. Dimensionality and sample size considerations in pattern recognition practice in Handbook of Statistics (eds Krishnaiah, P. R. & Kanal, L. N.) 835–855 (North-Holland, Amsterdam, 1982).

    Google Scholar 

  39. 39

    Peduzzi, P., Concato, J., Kemper, E., Holford, T. R. & Feinstein, A. R. A simulation study of the number of events per variable in logistic regression analysis. J. Clin. Epidemiol. 49, 1373–1379 (1996).

    CAS  PubMed  Article  Google Scholar 

  40. 40

    Ciaccia, P. & Patella, M. PAC nearest neighbor queries: approximate and controlled search in high dimensional and metric spaces. IEEE Comp. Soc. Proc. 16th Int. Conf. Data Engineering (ICDE) 244–255 (2000).

  41. 41

    Weber, R., Schek, H.-J. & Blott, S. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. Proc. 24th Int. Very Large Database Conf. 194–205 (1998).

  42. 42

    Agrawal, R., Gehrke, J., Gunopulos, D. & Raghavan, P. Automatic subspace clustering of high dimensional data for data mining applications. Proc. 1998 ACM-SIGMOD Int. Conf. Management Data 94–105 (1998).

  43. 43

    Beyer, K., Goldstein, J., Ramakrishnan, R. & Shaft, U. When is “nearest neighbor” meaningful? Proc. 7th Int. Conf. Database Theory 217–235 (1999). A mathematical demonstration of the effects of dimensionality on the equalization of distance measures in Euclidean spaces.

  44. 44

    Millman, V. D. A new proof of the theorem of A. Dvoretzky on sections of convex bodies. Funct. Anal. Appl. 5, 28–37 (1971).

    Google Scholar 

  45. 45

    Zhou, X., Wang, X. & Dougherty, E. R. Missing-value estimation using linear and non-linear regression with Bayesian gene selection. Bioinformatics 19, 2302–2307 (2003).

    CAS  PubMed  Article  Google Scholar 

  46. 46

    Donoho, D. L. High-dimensional data analysis: the curses and blessings of dimensionality. Stanford University Department of Statistics [online], (2000).

  47. 47

    Marshall, E. Getting the noise out of gene arrays. Science 306, 630–631 (2004).

    CAS  PubMed  Article  Google Scholar 

  48. 48

    Wang, Y., Lu, J., Lee, R., Gu, Z. & Clarke, R. Iterative normalization of cDNA microarray data. IEEE Trans. Inf. Technol. Biomed. 6, 29–37 (2002).

    PubMed  Article  Google Scholar 

  49. 49

    Chen, Y., Dougherty, E. R. & Bittner M. L. Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Optics. 24, 364–374 (1997).

    Article  Google Scholar 

  50. 50

    Bolstad, B. M., Irizarry, R. A., Astrand, M. & Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193 (2003). Probe intensity level methods for normalization of Affymetrix microarrays. A direct comparison is made between three widely used methods for reducing non-biological variation.

    CAS  Article  Google Scholar 

  51. 51

    Lei, X., Bandyopadhyay, A., Le, T. & Sun, L. Autocrine TGFβ supports growth and survival of human breast cancer MDA-MB-231 cells. Oncogene. 21, 7514–7523 (2002).

    CAS  PubMed  Article  Google Scholar 

  52. 52

    Kamaraju, A. K. & Roberts, A. B. Role of Rho/ROCK and p38 MAP kinase pathways in transforming growth factor-β-mediated Smad-dependent growth inhibition of human breast carcinoma cells in vivo. J. Biol. Chem. 280, 1024–1036 (2005).

    CAS  PubMed  Article  Google Scholar 

  53. 53

    Guise, T. A. & Chirgwin, J. M. Transforming growth factor-β in osteolytic breast cancer bone metastases. Clin. Orthop. Relat. Res. 415, S32–S38 (2003).

    Article  Google Scholar 

  54. 54

    Fox, S. W. & Lovibond, A. C. Current insights into the role of transforming growth factor-β in bone resorption. Mol. Cell Endocrinol. 243, 19–26 (2005).

    CAS  PubMed  Article  Google Scholar 

  55. 55

    Deckers, M. et al. The tumor suppressor Smad4 is required for transforming growth factor β-induced epithelial to mesenchymal transition and bone metastasis of breast cancer cells. Cancer Res. 66, 2202–2209 (2006).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  56. 56

    Bailey, J. P. et al. Prolactin and transforming growth factor-β signaling exert opposing effects on mammary gland morphogenesis, involution, and the Akt-forkhead pathway. Mol. Endocrinol. 18, 1171–1184 (2004).

    CAS  PubMed  Article  Google Scholar 

  57. 57

    Arteaga, C. L., Koli, K. M., Dugger, T. C. & Clarke, R. Reversal of tamoxifen resistance of human breast carcinomas in vivo with neutralizing anti-transforming growth factor (TGF)-β antibodies involves paracrine mechanisms. J. Natl Cancer Inst. 91, 46–53 (1999).

    CAS  PubMed  Article  Google Scholar 

  58. 58

    Martone, R. et al. Distribution of NF-κB-binding sites across human chromosome 22. Proc. Natl Acad. Sci. USA 100, 12247–12252 (2003).

    CAS  PubMed  Article  Google Scholar 

  59. 59

    Holst, F. et al. Estrogen receptor α (ESR1) gene amplification is frequent in breast cancer. Nature Genet. 39, 655–660 (2007).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  60. 60

    Carroll, J. S. et al. Genome-wide analysis of estrogen receptor binding sites. Nature Genet. 38, 1289–1297 (2006).

    CAS  PubMed  Article  Google Scholar 

  61. 61

    Clarke, R. & Brünner, N. Cross-resistance and molecular mechanisms in antiestrogen resistance. Endocr. Related Cancer 2, 59–72 (1995).

    CAS  Article  Google Scholar 

  62. 62

    Toledo, F. & Wahl, G. M. Regulating the p53 pathway: in vitro hypotheses, in vivo veritas. Nature Rev. Cancer 6, 909–923 (2006).

    CAS  Article  Google Scholar 

  63. 63

    Weisz, L., Oren, M. & Rotter, V. Transcription regulation by mutant p53. Oncogene 26, 2202–2211 (2007).

    CAS  Article  PubMed  Google Scholar 

  64. 64

    Clarke, R., Dickson, R. B. & Lippman, M. E. Hormonal aspects of breast cancer: growth factors, drugs and stromal interactions. Crit. Rev. Oncol. Hematol. 12, 1–23 (1992).

    CAS  PubMed  Article  Google Scholar 

  65. 65

    Shipitsin, M. et al. Molecular definition of breast tumor heterogeneity. Cancer Cell 11, 259–273 (2007).

    CAS  PubMed  Article  Google Scholar 

  66. 66

    Qiu, X., Brooks, A. I., Klebanov, L. & Yakovlev, N. The effects of normalization on the correlation structure of microarray data. BMC Bioinform 6, 120 (2005).

    Article  CAS  Google Scholar 

  67. 67

    Caldarelli, G., Pastor-Satorras, R. & Vespignani, A. Structure of cycles and local ordering in complex networks. Eur. Phys. J. B 38, 183–186 (2004).

    CAS  Article  Google Scholar 

  68. 68

    Gan, X., Liew, A. W. & Yan, H. Microarray missing data imputation based on a set theoretic framework and biological knowledge. Nucleic Acids Res. 34, 1608–1619 (2006).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  69. 69

    Bar-Joseph, Z., Farkash, S., Gifford, D. K., Simon, I. & Rosenfeld, R. Deconvolving cell cycle expression data with complementary information. Bioinformatics 20 (Suppl. 1), I23–I30 (2004).

    CAS  PubMed  Article  Google Scholar 

  70. 70

    Clarke, R. B. & Smith, G. H. Stem cells and tissue homeostasis in mammary glands. J. Mammary Gland Biol. Neoplasia 10, 1–3 (2005).

    PubMed  Article  Google Scholar 

  71. 71

    Al-Hajj, M. Cancer stem cells and oncology therapeutics. Curr. Opin. Oncol. 19, 61–64 (2007).

    PubMed  Google Scholar 

  72. 72

    Dontu, G., Al-Hajj, M., Abdallah, W. M., Clarke, M. F. & Wicha, M. S. Stem cells in normal breast development and breast cancer. Cell Prolif. 36 (Suppl. 1), 59–72 (2003).

    CAS  PubMed  Article  Google Scholar 

  73. 73

    Csanyi, G. & Szendroi, B. Fractal small-world dichotomy in real-world networks. Phys. Rev. E (Statistical, Nonlinear, and Soft Matter Physics) 70, 01612-2–01612-5 (2004).

    Google Scholar 

  74. 74

    Watts, D. J. & Strogatz, S. H. Collective dynamics of 'small-world' networks. Nature 393, 440–442 (1998).

    CAS  Article  Google Scholar 

  75. 75

    Newman, M. E. J. The structure and function of complex networks. Soc. Indust. Appl. Math. Rev. 45, 167–256 (2003).

    Google Scholar 

  76. 76

    Barabasi, A. L. & Albert, R. Emergence of scaling in random networks. Science 286, 509–512 (1999).

    CAS  Article  Google Scholar 

  77. 77

    van, Noort, V, Snel, B. & Huynen, M. A. The yeast coexpression network has a small-world, scale-free architecture and can be explained by a simple model. EMBO Rep. 5, 280–284 (2004).

    CAS  Article  Google Scholar 

  78. 78

    Bray, D. Molecular networks: the top-down view. Science 301, 1864–1865 (2003).

    CAS  PubMed  Article  Google Scholar 

  79. 79

    Alon, U. Biological networks: the tinkerer as an engineer. Science 301, 1866–1867 (2003).

    CAS  PubMed  Article  Google Scholar 

  80. 80

    Qiu, P., Wang, Z. J., Liu, K. J., Hu, Z. Z. & Wu, C. H. Dependence network modeling for biomarker identification. Bioinformatics 23, 198–206 (2007).

    CAS  PubMed  Article  Google Scholar 

  81. 81

    de la Fuente, A., Bing, N., Hoeschele, I. & Mendes, P. Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 20, 3565–3574 (2004).

    CAS  PubMed  Article  Google Scholar 

  82. 82

    Hirschhorn, J. N. & Daly, M. J. Genome-wide association studies for common diseases and complex traits. Nature Rev. Genet. 6, 95–108 (2005).

    CAS  Article  PubMed  Google Scholar 

  83. 83

    Ritchie, M. D. et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 69, 138–147 (2001).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  84. 84

    Ransohoff, D. F. Bias as a threat to the validity of cancer molecular-marker research. Nature Rev. Cancer 5, 142–149 (2005).

    CAS  Article  Google Scholar 

  85. 85

    Miller, D. J., Wang, Y. & Kesidis, G. Emergent unsupervised clustering paradigms with potential application to bioinformatics. Front. Biosci. 13, 677–690 (2008).

    CAS  PubMed  Article  Google Scholar 

  86. 86

    Clarke, R. et al. Antiestrogen resistance in breast cancer and the role of estrogen receptor signaling. Oncogene 22, 7316–7339 (2003).

    CAS  PubMed  Article  Google Scholar 

  87. 87

    Kamby, C. et al. Histological grade and steroid receptor content of primary breast cancer — impact on prognosis and possible modes of action. Br. J. Cancer 58, 480–486 (1988).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  88. 88

    Aamdal, S. et al. Estrogen receptor and long term prognosis in breast cancer. Cancer 53, 2525–2529 (1984).

    CAS  PubMed  Article  Google Scholar 

  89. 89

    Early Breast Cancer Trialists' Collaborative Group. Tamoxifen for early breast cancer. Cochrane Database Syst. Rev. 1, CD000486 (2001).

  90. 90

    Ressom, H. W., Varghese, R. S., Zhang, Z., Xuan, J. & Clarke, R. Classification algorithms for phenotype prediction in genomics and proteomics. Front. Biosci. 13, 691–708 (2008).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  91. 91

    Guyon, J., Weston, J., Barnhill, M. D. & Vapnik, V. Gene selection for cancer classification using support vector machines. Machine Learning 46, 389–422 (2003). A recursive feature elimination method to select discriminant genes for SVM-based cancer classification.

    Article  Google Scholar 

  92. 92

    Pudil, P., Novovicova, J. & Kittler, J. Floating search methods in feature selection. Pattern Recogn Lett. 15, 1119–1125 (1994).

    Article  Google Scholar 

  93. 93

    Somol, P., Pudil, P., Novovicova, J. & Paclik, P. Adaptive floating search methods in feature selection. Pattern Recogn Lett. 20, 1157–1163 (2007).

    Article  Google Scholar 

  94. 94

    Velculescu, V. E., Zhang, L., Vogelstein, B. & Kinzler, K. W. Serial analysis of gene expression. Science 270, 484–487 (1995).

    CAS  Article  Google Scholar 

  95. 95

    Somiari, R. I. et al. High-throughput proteomic analysis of human infiltrating ductal carcinoma of the breast. Proteomics 3, 1863–1873 (2003).

    CAS  PubMed  Article  Google Scholar 

  96. 96

    Skaar, T. C. et al. Two-dimensional gel electrophoresis analyses identify nucleophosmin as an estrogen-regulated protein associated with acquired estrogen-independence in human breast cancer cells. J. Steroid Biochem. Mol. Biol. 67, 391–402 (1998).

    CAS  PubMed  Article  Google Scholar 

  97. 97

    Haab, B. B. Antibody arrays in cancer research. Mol. Cell Proteomics. 4, 377–383 (2005).

    CAS  PubMed  Article  Google Scholar 

  98. 98

    Tirkkonen, M. et al. Molecular cytogenetics of primary breast cancer by CGH. Genes Chromosomes Cancer 21, 177–184 (1998).

    CAS  PubMed  Article  Google Scholar 

  99. 99

    Chin, K. et al. Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell 10, 529–541 (2006).

    CAS  Article  PubMed  Google Scholar 

  100. 100

    Scharpf, R. B., Ting, J. C., Pevsner, J. & Ruczinski, I. SNPchip: R. classes and methods for SNP array data. Bioinformatics 23, 627–628 (2007).

    CAS  PubMed  Article  Google Scholar 

  101. 101

    Huang, T. H., Perry, M. R. & Laux, D. E. Methylation profiling of CpG islands in human breast cancer cells. Hum. Mol. Genet. 8, 459–470 (1999).

    CAS  PubMed  Article  Google Scholar 

  102. 102

    Ren, B. et al. Genome-wide location and function of DNA binding proteins. Science 290, 2306–2309 (2000).

    CAS  PubMed  Article  Google Scholar 

  103. 103

    Claverie, J. M. Computational methods for the identification of differential and coordinated gene expression. Hum. Mol. Genet. 8, 1821–1832 (1999).

    CAS  PubMed  Article  Google Scholar 

  104. 104

    Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate — a practical and powerful approach to multiple testing. J. Royal Statl Soc. Series B Method. 57, 289–300 (1995).

    Google Scholar 

  105. 105

    Storey, J. D. & Tibshirani, R. Statistical methods for identifying differentially expressed genes in DNA microarrays. Methods Mol. Biol. 224, 149–157 (2003).

    CAS  PubMed  Google Scholar 

  106. 106

    Taylor, J., Tibshirani, R. & Efron, B. The 'miss rate' for the analysis of gene expression data. Biostatistics 6, 111–117 (2005).

    PubMed  Article  Google Scholar 

  107. 107

    Wright, G. W. & Simon, R. M. A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics 19, 2448–2455 (2003).

    CAS  PubMed  Article  Google Scholar 

  108. 108

    Benjamini, Y. & Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001).

    Article  Google Scholar 

  109. 109

    Romano, J. P. & Wolf, M. Control of generalized error rates in multiple testing. Ann. Statis. 35, 1378–1408 (2007). A recursive feature elimination method to select discriminant genes for SVM-based cancer classification.

    Article  Google Scholar 

  110. 110

    Vapnik, V. Statistical Learning Theory (Wiley, New York, 1998).

    Google Scholar 

  111. 111

    Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining Inference, and Prediction (Springer, New York, 2001).

    Google Scholar 

  112. 112

    Duda, R. O., Hart, P. E. & Stork, D. G. Pattern Classification (Wiley, New York, 2001).

    Google Scholar 

  113. 113

    Wang, Z. et al. Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data. Bioinformatics 22, 755–761 (2006).

    CAS  PubMed  Article  Google Scholar 

  114. 114

    Poggio, T., Rifkin, R., Mukherjee, S. & Niyogi, P. General conditions for predictivity in learning theory. Nature 428, 419–422 (2004). Stability analysis to determine the generalization capability of supervised learning algorithms

    CAS  PubMed  Article  Google Scholar 

  115. 115

    Ripley, B. Pattern Recognition and Neural Networks (Cambridge Univ. Press, Cambridge, 1996).

    Google Scholar 

  116. 116

    Wang, Y., Lin, S. H., Li, H. & Kung, S. Y. Data mapping by probabilistic modular networks and information theoretic criteria. IEEE Trans. Signal Processing 46, 3378–3397 (1998).

    Article  Google Scholar 

  117. 117

    Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441 and 498–520 (1933).

    Article  Google Scholar 

  118. 118

    Wang, A. & Gehan, E. A. Gene selection for microarray data analysis using principal component analysis. Stat. Med. 24, 2069–2087 (2005).

    PubMed  Article  Google Scholar 

  119. 119

    Liu, A., Zhang, Y., Gehan, E. & Clarke, R. Block principal component analysis with application to gene microarray data classification. Stat. Med. 21, 3465–3474 (2002).

    PubMed  Article  Google Scholar 

  120. 120

    Zhu, Y. et al. Estrogen receptor α (ER) positive breast tumors and breast cancer cell lines share similarities in their transcriptome data structures. Int. J. Oncol. 29, 1581–1589 (2006).

    CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We wish to thank D. J. Miller (Department of Electrical Engineering, The Pennsylvania State University) for critical reading of the manuscript. Some of the issues we discuss may appear overly simplified to experts. Several of the emerging concepts have yet to appear in the biomedical literature and publications might not be accessible through PubMed (but are often found at an author's or journal's homepage or at CiteSeer). Many of the engineering and computer science works published in 'proceedings' represent peer-reviewed publications. This work was supported in part by Public Health Service grants R01-CA096483, U54-CA100970, R33-EB000830, R33-CA109872, 1P30-CA51008, R03-CA119313, and a U.S. Department of Defense Breast Cancer Research Program award BC030280.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Robert Clarke.

Related links

Related links

FURTHER INFORMATION

Robert Clarke's homepage

Yue Wang's homepage

Gene Ontology

Kyoto Encyclopedia of Genes and Genomes

Protein Information Resource

CiteSeer

Glossary

t-Test

A significance test for assessing hypotheses about population means, usually a test of the equality of means of two independent populations.

False discovery rate

A univariate statistical method that controls the type I (false-positive) errors to correct for multiple testing.

Cluster

A cluster consists of a relatively high density of data points separated from other clusters by a relative low density of points; patterns within a cluster are more similar to each other than patterns belonging to different clusters.

Massively parallel

A large number of simultaneous processes; in biology, a living cell has multiple, concurrently active processes that are reflected in the proteome and its underlying transcriptome.

Trap of self-fulfilling prophesy

With thousands of measurements and the concurrent presence of multiple sub-phenotypes, intuitively logical but functionally incorrect associations may be implied between a signal's (gene or protein) perceived or known function in a biological system or phenotype of interest.

Euclidean space

Any mathematical space that is a generalization of the two- and three-dimensional spaces described by the axioms and definitions of Euclidean geometry, for example, properties of angles of plane triangles and of straight and parallel lines.

k-Means clustering algorithm

A method of cluster analysis in which, from an initial partition of the observations into k clusters, each observation in turn is examined and reassigned, if appropriate, to a different cluster, in an attempt to optimize a predefined numerical criterion that measures, in some sense, the quality of the cluster solution.

Vector

A coordinate-based data structure in which the information is represented by a magnitude and a direction.

Hard clustering

Any clustering method that forces a data point to belong only to a single cluster.

Test pattern

Also known as the testing set, this is a data point(s) that was not part of the training set, for example, in a leave-one-out approach the testing set is the sample that was left out during training.

Training set

The sample of observations from which a classification function is derived.

Null hypothesis

The hypothesis that there is no difference between the two groups for a variable that is being compared.

Family-wise error rate

The probability of making any error in a given family of inferences, rather than a per-comparison error rate or a per-experiment error rate.

Clustering algorithm

Procedure designed to find natural groupings or clusters in multidimensional data on the basis of measured or perceived similarities among the patterns.

Surrogate

A measure that substitutes for (and correlates with) a real endpoint but has no assured relationship; for example, tumour shrinkage in response to chemotherapy (surrogate) does not assure that the patient will live longer (endpoint).

Soft clustering

Any clustering method that allows a data point to be a member of more than one cluster.

Vector space

Space where data are represented by vectors that may be scaled and added as in linear algebra; two-dimensional Euclidean space is one form of vector space.

Metric space

A data space where the distance between each data point is specifically defined.

Hierarchical clustering

A series of models for a set of observations, where each model results from adding (or deleting) parameters from other models in the series.

Logistic regression

A method of analysis concerned with estimating the parameters in a postulated relationship between a response variable (binary for logistic regression) and one or more explanatory variables.

Hyperplane

A higher-dimensional generalization of the concepts of a plane in three-dimensional (or a line in two-dimensional) Euclidean geometry. A plane is a surface where, for any two points on the surface, the straight line that passes through those points also lies on the surface.

Kernel function

A mathematical transform operated upon one or multiple input variables; inner product or convolution is a popular form of kernel function.

Regression coefficient

A component of a statistical model in which a response variable is estimated by a number of explanatory variables, each combined with a regression coefficient that gives the estimated change in the response variable corresponding to a unit change in the appropriate explanatory variable.

Agglomerative hierarchical clustering

Methods of cluster analysis that begin with each individual in a separate cluster and then, in a series of steps, combine individuals, and later clusters, into new and larger clusters, until a final stage is reached where all individuals are members of a single group.

Missing value estimation

When an expected value is not reported for a specific gene or protein the missing value can be estimated and the estimated value used for data analysis.

Normally distributed

The value of a random variable(s) follows a probability-density function completely specified by the mean and variance.

Parametric statistical model

A statistical model, the probability distribution of which is specified by a relatively small set of quantitative parameters.

Subcubical neighbourhood

A smaller cubical area of a larger hypercubical space.

Hypercubical neighbourhood

A higher-dimensional generalization of the concepts of a cubic neighbourhood in three-dimensional Euclidean geometry.

Bootstrap re-sampling

A statistical method that iteratively uses subsets of the original data set to estimate the bias and variance for a classification algorithm.

Scale-free behaviour

The behaviour of an estimator is scale-free if it depends only on the ranks of the observations, for example, the estimator is equally accurate whether the logarithms of the observations or the values of the observations are used for analysis.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Clarke, R., Ressom, H., Wang, A. et al. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer 8, 37–49 (2008). https://doi.org/10.1038/nrc2294

Download citation

Further reading

Search

Sign up for the Nature Briefing newsletter for a daily update on COVID-19 science.
Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing