The properties of high-dimensional data spaces: implications for exploring gene and protein expression data

Clarke, Robert; Ressom, Habtom W.; Wang, Antai; Xuan, Jianhua; Liu, Minetta C.; Gehan, Edmund A.; Wang, Yue

doi:10.1038/nrc2294

Review Article
Published: January 2008

The properties of high-dimensional data spaces: implications for exploring gene and protein expression data

Robert Clarke^1,2,
Habtom W. Ressom^1,3,
Antai Wang³,
Jianhua Xuan⁴,
Minetta C. Liu¹,
Edmund A. Gehan³ &
…
Yue Wang⁴

Nature Reviews Cancer volume 8, pages 37–49 (2008)Cite this article

4150 Accesses
391 Citations
16 Altmetric
Metrics details

Key Points

The application of several high-throughput genomic and proteomic technologies to address questions in cancer diagnosis, prognosis and prediction generate high-dimensional data sets.
The multimodality of high-dimensional cancer data, for example, as a consequence of the heterogeneous and dynamic nature of cancer tissues, the concurrent expression of multiple biological processes and the diverse and often tissue-specific activities of single genes, can confound both simple mechanistic interpretations of cancer biology and the generation of complete or accurate gene signal transduction pathways or networks.
The mathematical and statistical properties of high-dimensional data spaces are often poorly understood or inadequately considered. This can be particularly challenging for the common scenario where the number of data points obtained for each specimen greatly exceed the number of specimens.
Data are rarely randomly distributed in high-dimensions and are highly correlated, often with spurious correlations.
The distances between a data point and its nearest and farthest neighbours can become equidistant in high dimensions, potentially compromising the accuracy of some distance-based analysis tools.
Owing to the 'curse of dimensionality' phenomenon and its negative impact on generalization performance, for example, estimation instability, model overfitting and local convergence, the large estimation error from complex statistical models can easily compromise the prediction advantage provided by their greater representation power. Conversely, simpler statistical models may produce more reproducible predictions but their predictions may not always be adequate.
Some machine learning methods address the 'curse of dimensionality' in high-dimensional data analysis through feature selection and dimensionality reduction, leading to better data visualization and improved classification.
It is important to ensure that the generalization capability of classifiers derived by supervised learning methods from high-dimensional data before using them for cancer diagnosis, prognosis or prediction. Although this can be assessed initially through cross-validation methods, a more rigorous approach is needed, that is, to validate classifier performance using a blind validation data set(s) that was not used during supervised learning.

Abstract

High-throughput genomic and proteomic technologies are widely used in cancer research to build better predictive models of diagnosis, prognosis and therapy, to identify and characterize key signalling networks and to find new targets for drug development. These technologies present investigators with the task of extracting meaningful statistical and biological information from high-dimensional data spaces, wherein each sample is defined by hundreds or thousands of measurements, usually concurrently obtained. The properties of high dimensionality are often poorly understood or overlooked in data modelling and analysis. From the perspective of translational science, this Review discusses the properties of high-dimensional data spaces that arise in genomic and proteomic studies and the challenges they can pose for data analysis and interpretation.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Cluster separability in data space.**

**Figure 2: High dimensional expression data are multimodal.**

**Figure 3: Model fitting, dimensionality and the blessings of smoothness.**

**Figure 4: The curse of dimensionality and the bias or variance dilemma.**

High-throughput proteomics: a methodological mini-review

Article 03 August 2022

Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer

Article Open access 05 January 2021

Using clusterProfiler to characterize multiomics data

Article 17 July 2024

References

Khan, J. et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Med. 7, 673–679 (2001). Example of the successful use of molecular profiling to improve cancer diagnosis.
Article CAS PubMed Google Scholar
Bhanot, G., Alexe, G., Levine, A. J. & Stolovitzky, G. Robust diagnosis of non-Hodgkin lymphoma phenotypes validated on gene expression data from different laboratories. Genome Inform. 16, 233–244 (2005).
CAS PubMed Google Scholar
Lin, Y. H. et al. Multiple gene expression classifiers from different array platforms predict poor prognosis of colorectal cancer. Clin. Cancer Res. 13, 498–507 (2007).
Article CAS PubMed Google Scholar
Lopez-Rios, F. et al. Global gene expression profiling of pleural mesotheliomas: overexpression of aurora kinases and P16/CDKN2A deletion as prognostic factors and critical evaluation of microarray-based prognostic prediction. Cancer Res. 66, 2970–2979 (2006).
Article CAS PubMed Google Scholar
Ganly, I. et al. Identification of angiogenesis/metastases genes predicting chemoradiotherapy response in patients with laryngopharyngeal carcinoma. J. Clin. Oncol. 25, 1369–1376 (2007).
Article CAS PubMed Google Scholar
Ayers, M. et al. Gene expression profiles predict complete pathologic response to neoadjuvant paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide chemotherapy in breast cancer. J. Clin. Oncol. 22, 2284–2293 (2004).
Article CAS PubMed Google Scholar
Van't Veer, L. J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002). Example of the use of molecular profiling for prognosis that led to the MammaPrint classification scheme for breast cancer.
Article CAS Google Scholar
van de Vijver, M. J. et al. A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med. 347, 1999–2009 (2002).
Article CAS PubMed Google Scholar
Gomez, B. P. et al. Human X-Box binding protein-1 confers both estrogen independence and antiestrogen resistance in breast cancer cell lines. FASEB J. 21, 4013–4027 (2007).
Article CAS PubMed Google Scholar
Meleth, S. et al. Novel approaches to smoothing and comparing SELDI TOF spectra. Cancer Inform. 1, 78–85 (2005).
Article CAS PubMed Google Scholar
Satagopan, J. M. & Panageas, K. S. A statistical perspective on gene expression data analysis. Stat. Med. 22, 481–499 (2003).
Article PubMed Google Scholar
Allison, D. B., Cui, X., Page, G. P. & Sabripour, M. Microarray data analysis: from disarray to consolidation and consensus. Nature Rev. Genet. 7, 55–65 (2006).
Article CAS PubMed Google Scholar
Slonim, D. K. From patterns to pathways: gene expression data analysis comes of age. Nature Genet. 32, 502–508 (2002).
Article CAS PubMed Google Scholar
Liang, M. P., Troyanskaya, O. G., Laederach, A., Brutlag, D. L. & Altman, R. B. Computational functional genomics. Signal Processing Magazine IEEE 21, 62–69 (2004).
Article Google Scholar
Tononi, G., Sporns, O. & Edelman, G. M. Measures of degeneracy and redundancy in biological networks. Proc. Natl Acad. Sci. 96, 3257–3262 (1999). Application of concepts from information theory to explore the complexity and robustness of biological networks.
Article CAS PubMed PubMed Central Google Scholar
Clarke, R., Leonessa, F., Welch, J. N. & Skaar, T. C. Cellular and molecular pharmacology of antiestrogen action and resistance. Pharmacol. Rev. 53, 25–71 (2001).
CAS PubMed Google Scholar
Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA 95, 14863–14868 (1998).
Article CAS PubMed PubMed Central Google Scholar
Perou, C. M. et al. Molecular portraits of human breast tumours. Nature 406, 747–752 (2000).
Article CAS PubMed Google Scholar
Jain, A. K., Duin, R. P. W. & Mao, J. Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Machine Intell. 22, 4–37 (2000).
Article Google Scholar
Andersson, A. et al. Microarray-based classification of a consecutive series of 121 childhood acute leukemias: prediction of leukemic and genetic subtype as well as of minimal residual disease status. Leukemia 21, 1198–1203 (2007).
Article CAS PubMed Google Scholar
Gu, Z. et al. Association of interferon regulatory factor-1, nucleophosmin, nuclear factor-κB, and cyclic AMP response element binding with acquired resistance to faslodex (ICI 182,780). Cancer Res. 62, 3428–3437 (2002).
CAS PubMed Google Scholar
Huang, E. et al. Gene expression phenotypic models that predict the activity of oncogenic pathways. Nature Genet. 34, 226–230 (2003).
Article CAS PubMed Google Scholar
Golub, T. R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999).
Article CAS PubMed Google Scholar
Frey, B. J. & Dueck, D. Clustering by passing messages between data points. Science 315, 972–976 (2007).
Article CAS PubMed Google Scholar
Miller, D. J., Pal, S. & Wang, Y. Constraint-based transductive learning for distributed ensemble classification. Proc. 16th IEEE Workshop Machine Learning Signal Processing 15–20 (2006).
Tritton, T. R. & Yee, G. The anticancer agent adriamycin can be actively cytotoxic without entering cells. Science 217, 248–250 (1982).
Article CAS Google Scholar
Gewirtz, D. A. A critical evaluation of the mechanisms of action proposed for the antitumor effects of the anthracycline antibiotics adriamycin and daunorubicin. Biochem. Pharmacol. 57, 727–741 (1999).
Article CAS PubMed Google Scholar
Feldman, L. D., Hortobagyi, G. N., Buzdar, A. U., Ames, F. C. & Blumenschein, G. R. Pathological assessment of response to induction chemotherapy in breast cancer. Cancer Res. 46, 2578–2581 (1986).
CAS PubMed Google Scholar
Chollet, P. et al. Clinical and pathological response to primary chemotherapy in operable breast cancer. Eur. J. Cancer 33, 862–866 (1997).
Article CAS PubMed Google Scholar
Chollet, P. et al. Prognostic significance of a complete pathological response after induction chemotherapy in operable breast cancer. Br. J. Cancer 86, 1041–1046 (2002).
Article CAS PubMed PubMed Central Google Scholar
Bellman, R. Adaptive Control Processes: A Guided Tour (Princeton Univ., Princeton, 1961).
Book Google Scholar
Chavez, E. & Navarro, G. In Algorithm Engineering and Experimentation (eds Buchsbaum, A. L. & Snoeyink, J.) 147–160 (Springer, Heidelberg, 2001).
Book Google Scholar
Scott, D. W. Multivariate Density Estimation (John Wiley, Hoboken, 1992).
Book Google Scholar
Hinneburg, A. & Keim, D. A. Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. Proc. 25th Int. Very Large Database Conf. 506–517 (1999).
Dudoit, S., Fridlyand, J. & Speed, T. P. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 97, 77–87 (2002).
Article CAS Google Scholar
Haykin, S. Neural Networks: A Comprehensive Foundation (Prentice Hall, Upper Saddle River, 1999).
Google Scholar
Lee, M. L., Kuo, F. C., Whitmore, G. A. & Sklar, J. Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl Acad. Sci. USA 97, 9834–9839 (2000).
Article CAS PubMed PubMed Central Google Scholar
Jain, A. K. & Chandrasekaran, B. Dimensionality and sample size considerations in pattern recognition practice in Handbook of Statistics (eds Krishnaiah, P. R. & Kanal, L. N.) 835–855 (North-Holland, Amsterdam, 1982).
Google Scholar
Peduzzi, P., Concato, J., Kemper, E., Holford, T. R. & Feinstein, A. R. A simulation study of the number of events per variable in logistic regression analysis. J. Clin. Epidemiol. 49, 1373–1379 (1996).
Article CAS PubMed Google Scholar
Ciaccia, P. & Patella, M. PAC nearest neighbor queries: approximate and controlled search in high dimensional and metric spaces. IEEE Comp. Soc. Proc. 16th Int. Conf. Data Engineering (ICDE) 244–255 (2000).
Weber, R., Schek, H.-J. & Blott, S. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. Proc. 24th Int. Very Large Database Conf. 194–205 (1998).
Agrawal, R., Gehrke, J., Gunopulos, D. & Raghavan, P. Automatic subspace clustering of high dimensional data for data mining applications. Proc. 1998 ACM-SIGMOD Int. Conf. Management Data 94–105 (1998).
Beyer, K., Goldstein, J., Ramakrishnan, R. & Shaft, U. When is “nearest neighbor” meaningful? Proc. 7th Int. Conf. Database Theory 217–235 (1999). A mathematical demonstration of the effects of dimensionality on the equalization of distance measures in Euclidean spaces.
Millman, V. D. A new proof of the theorem of A. Dvoretzky on sections of convex bodies. Funct. Anal. Appl. 5, 28–37 (1971).
Google Scholar
Zhou, X., Wang, X. & Dougherty, E. R. Missing-value estimation using linear and non-linear regression with Bayesian gene selection. Bioinformatics 19, 2302–2307 (2003).
Article CAS PubMed Google Scholar
Donoho, D. L. High-dimensional data analysis: the curses and blessings of dimensionality. Stanford University Department of Statistics [online], (2000).
Marshall, E. Getting the noise out of gene arrays. Science 306, 630–631 (2004).
Article CAS PubMed Google Scholar
Wang, Y., Lu, J., Lee, R., Gu, Z. & Clarke, R. Iterative normalization of cDNA microarray data. IEEE Trans. Inf. Technol. Biomed. 6, 29–37 (2002).
Article PubMed Google Scholar
Chen, Y., Dougherty, E. R. & Bittner M. L. Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Optics. 24, 364–374 (1997).
Article Google Scholar
Bolstad, B. M., Irizarry, R. A., Astrand, M. & Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193 (2003). Probe intensity level methods for normalization of Affymetrix microarrays. A direct comparison is made between three widely used methods for reducing non-biological variation.
Article CAS PubMed Google Scholar
Lei, X., Bandyopadhyay, A., Le, T. & Sun, L. Autocrine TGFβ supports growth and survival of human breast cancer MDA-MB-231 cells. Oncogene. 21, 7514–7523 (2002).
Article CAS PubMed Google Scholar
Kamaraju, A. K. & Roberts, A. B. Role of Rho/ROCK and p38 MAP kinase pathways in transforming growth factor-β-mediated Smad-dependent growth inhibition of human breast carcinoma cells in vivo. J. Biol. Chem. 280, 1024–1036 (2005).
Article CAS PubMed Google Scholar
Guise, T. A. & Chirgwin, J. M. Transforming growth factor-β in osteolytic breast cancer bone metastases. Clin. Orthop. Relat. Res. 415, S32–S38 (2003).
Article Google Scholar
Fox, S. W. & Lovibond, A. C. Current insights into the role of transforming growth factor-β in bone resorption. Mol. Cell Endocrinol. 243, 19–26 (2005).
Article CAS PubMed Google Scholar
Deckers, M. et al. The tumor suppressor Smad4 is required for transforming growth factor β-induced epithelial to mesenchymal transition and bone metastasis of breast cancer cells. Cancer Res. 66, 2202–2209 (2006).
Article CAS PubMed Google Scholar
Bailey, J. P. et al. Prolactin and transforming growth factor-β signaling exert opposing effects on mammary gland morphogenesis, involution, and the Akt-forkhead pathway. Mol. Endocrinol. 18, 1171–1184 (2004).
Article CAS PubMed Google Scholar
Arteaga, C. L., Koli, K. M., Dugger, T. C. & Clarke, R. Reversal of tamoxifen resistance of human breast carcinomas in vivo with neutralizing anti-transforming growth factor (TGF)-β antibodies involves paracrine mechanisms. J. Natl Cancer Inst. 91, 46–53 (1999).
Article CAS PubMed Google Scholar
Martone, R. et al. Distribution of NF-κB-binding sites across human chromosome 22. Proc. Natl Acad. Sci. USA 100, 12247–12252 (2003).
Article CAS PubMed PubMed Central Google Scholar
Holst, F. et al. Estrogen receptor α (ESR1) gene amplification is frequent in breast cancer. Nature Genet. 39, 655–660 (2007).
Article CAS PubMed Google Scholar
Carroll, J. S. et al. Genome-wide analysis of estrogen receptor binding sites. Nature Genet. 38, 1289–1297 (2006).
Article CAS PubMed Google Scholar
Clarke, R. & Brünner, N. Cross-resistance and molecular mechanisms in antiestrogen resistance. Endocr. Related Cancer 2, 59–72 (1995).
Article CAS Google Scholar
Toledo, F. & Wahl, G. M. Regulating the p53 pathway: in vitro hypotheses, in vivo veritas. Nature Rev. Cancer 6, 909–923 (2006).
Article CAS Google Scholar
Weisz, L., Oren, M. & Rotter, V. Transcription regulation by mutant p53. Oncogene 26, 2202–2211 (2007).
Article CAS PubMed Google Scholar
Clarke, R., Dickson, R. B. & Lippman, M. E. Hormonal aspects of breast cancer: growth factors, drugs and stromal interactions. Crit. Rev. Oncol. Hematol. 12, 1–23 (1992).
Article CAS PubMed Google Scholar
Shipitsin, M. et al. Molecular definition of breast tumor heterogeneity. Cancer Cell 11, 259–273 (2007).
Article CAS PubMed Google Scholar
Qiu, X., Brooks, A. I., Klebanov, L. & Yakovlev, N. The effects of normalization on the correlation structure of microarray data. BMC Bioinform 6, 120 (2005).
Article CAS Google Scholar
Caldarelli, G., Pastor-Satorras, R. & Vespignani, A. Structure of cycles and local ordering in complex networks. Eur. Phys. J. B 38, 183–186 (2004).
Article CAS Google Scholar
Gan, X., Liew, A. W. & Yan, H. Microarray missing data imputation based on a set theoretic framework and biological knowledge. Nucleic Acids Res. 34, 1608–1619 (2006).
Article CAS PubMed PubMed Central Google Scholar
Bar-Joseph, Z., Farkash, S., Gifford, D. K., Simon, I. & Rosenfeld, R. Deconvolving cell cycle expression data with complementary information. Bioinformatics 20 (Suppl. 1), I23–I30 (2004).
Article CAS PubMed Google Scholar
Clarke, R. B. & Smith, G. H. Stem cells and tissue homeostasis in mammary glands. J. Mammary Gland Biol. Neoplasia 10, 1–3 (2005).
Article PubMed Google Scholar
Al-Hajj, M. Cancer stem cells and oncology therapeutics. Curr. Opin. Oncol. 19, 61–64 (2007).
PubMed Google Scholar
Dontu, G., Al-Hajj, M., Abdallah, W. M., Clarke, M. F. & Wicha, M. S. Stem cells in normal breast development and breast cancer. Cell Prolif. 36 (Suppl. 1), 59–72 (2003).
Article CAS PubMed PubMed Central Google Scholar
Csanyi, G. & Szendroi, B. Fractal small-world dichotomy in real-world networks. Phys. Rev. E (Statistical, Nonlinear, and Soft Matter Physics) 70, 01612-2–01612-5 (2004).
Google Scholar
Watts, D. J. & Strogatz, S. H. Collective dynamics of 'small-world' networks. Nature 393, 440–442 (1998).
Article CAS PubMed Google Scholar
Newman, M. E. J. The structure and function of complex networks. Soc. Indust. Appl. Math. Rev. 45, 167–256 (2003).
Google Scholar
Barabasi, A. L. & Albert, R. Emergence of scaling in random networks. Science 286, 509–512 (1999).
Article CAS PubMed Google Scholar
van, Noort, V, Snel, B. & Huynen, M. A. The yeast coexpression network has a small-world, scale-free architecture and can be explained by a simple model. EMBO Rep. 5, 280–284 (2004).
Article CAS Google Scholar
Bray, D. Molecular networks: the top-down view. Science 301, 1864–1865 (2003).
Article CAS PubMed Google Scholar
Alon, U. Biological networks: the tinkerer as an engineer. Science 301, 1866–1867 (2003).
Article CAS PubMed Google Scholar
Qiu, P., Wang, Z. J., Liu, K. J., Hu, Z. Z. & Wu, C. H. Dependence network modeling for biomarker identification. Bioinformatics 23, 198–206 (2007).
Article CAS PubMed Google Scholar
de la Fuente, A., Bing, N., Hoeschele, I. & Mendes, P. Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 20, 3565–3574 (2004).
Article CAS PubMed Google Scholar
Hirschhorn, J. N. & Daly, M. J. Genome-wide association studies for common diseases and complex traits. Nature Rev. Genet. 6, 95–108 (2005).
Article CAS PubMed Google Scholar
Ritchie, M. D. et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 69, 138–147 (2001).
Article CAS PubMed PubMed Central Google Scholar
Ransohoff, D. F. Bias as a threat to the validity of cancer molecular-marker research. Nature Rev. Cancer 5, 142–149 (2005).
Article CAS Google Scholar
Miller, D. J., Wang, Y. & Kesidis, G. Emergent unsupervised clustering paradigms with potential application to bioinformatics. Front. Biosci. 13, 677–690 (2008).
Article CAS PubMed Google Scholar
Clarke, R. et al. Antiestrogen resistance in breast cancer and the role of estrogen receptor signaling. Oncogene 22, 7316–7339 (2003).
Article CAS PubMed Google Scholar
Kamby, C. et al. Histological grade and steroid receptor content of primary breast cancer — impact on prognosis and possible modes of action. Br. J. Cancer 58, 480–486 (1988).
Article CAS PubMed PubMed Central Google Scholar
Aamdal, S. et al. Estrogen receptor and long term prognosis in breast cancer. Cancer 53, 2525–2529 (1984).
Article CAS PubMed Google Scholar
Early Breast Cancer Trialists' Collaborative Group. Tamoxifen for early breast cancer. Cochrane Database Syst. Rev. 1, CD000486 (2001).
Ressom, H. W., Varghese, R. S., Zhang, Z., Xuan, J. & Clarke, R. Classification algorithms for phenotype prediction in genomics and proteomics. Front. Biosci. 13, 691–708 (2008).
Article CAS PubMed PubMed Central Google Scholar
Guyon, J., Weston, J., Barnhill, M. D. & Vapnik, V. Gene selection for cancer classification using support vector machines. Machine Learning 46, 389–422 (2003). A recursive feature elimination method to select discriminant genes for SVM-based cancer classification.
Article Google Scholar
Pudil, P., Novovicova, J. & Kittler, J. Floating search methods in feature selection. Pattern Recogn Lett. 15, 1119–1125 (1994).
Article Google Scholar
Somol, P., Pudil, P., Novovicova, J. & Paclik, P. Adaptive floating search methods in feature selection. Pattern Recogn Lett. 20, 1157–1163 (2007).
Article Google Scholar
Velculescu, V. E., Zhang, L., Vogelstein, B. & Kinzler, K. W. Serial analysis of gene expression. Science 270, 484–487 (1995).
Article CAS PubMed Google Scholar
Somiari, R. I. et al. High-throughput proteomic analysis of human infiltrating ductal carcinoma of the breast. Proteomics 3, 1863–1873 (2003).
Article CAS PubMed Google Scholar
Skaar, T. C. et al. Two-dimensional gel electrophoresis analyses identify nucleophosmin as an estrogen-regulated protein associated with acquired estrogen-independence in human breast cancer cells. J. Steroid Biochem. Mol. Biol. 67, 391–402 (1998).
Article CAS PubMed Google Scholar
Haab, B. B. Antibody arrays in cancer research. Mol. Cell Proteomics. 4, 377–383 (2005).
Article CAS PubMed Google Scholar
Tirkkonen, M. et al. Molecular cytogenetics of primary breast cancer by CGH. Genes Chromosomes Cancer 21, 177–184 (1998).
Article CAS PubMed Google Scholar
Chin, K. et al. Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell 10, 529–541 (2006).
Article CAS PubMed Google Scholar
Scharpf, R. B., Ting, J. C., Pevsner, J. & Ruczinski, I. SNPchip: R. classes and methods for SNP array data. Bioinformatics 23, 627–628 (2007).
Article CAS PubMed Google Scholar
Huang, T. H., Perry, M. R. & Laux, D. E. Methylation profiling of CpG islands in human breast cancer cells. Hum. Mol. Genet. 8, 459–470 (1999).
Article CAS PubMed Google Scholar
Ren, B. et al. Genome-wide location and function of DNA binding proteins. Science 290, 2306–2309 (2000).
Article CAS PubMed Google Scholar
Claverie, J. M. Computational methods for the identification of differential and coordinated gene expression. Hum. Mol. Genet. 8, 1821–1832 (1999).
Article CAS PubMed Google Scholar
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate — a practical and powerful approach to multiple testing. J. Royal Statl Soc. Series B Method. 57, 289–300 (1995).
Google Scholar
Storey, J. D. & Tibshirani, R. Statistical methods for identifying differentially expressed genes in DNA microarrays. Methods Mol. Biol. 224, 149–157 (2003).
CAS PubMed Google Scholar
Taylor, J., Tibshirani, R. & Efron, B. The 'miss rate' for the analysis of gene expression data. Biostatistics 6, 111–117 (2005).
Article PubMed Google Scholar
Wright, G. W. & Simon, R. M. A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics 19, 2448–2455 (2003).
Article CAS PubMed Google Scholar
Benjamini, Y. & Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001).
Article Google Scholar
Romano, J. P. & Wolf, M. Control of generalized error rates in multiple testing. Ann. Statis. 35, 1378–1408 (2007). A recursive feature elimination method to select discriminant genes for SVM-based cancer classification.
Article Google Scholar
Vapnik, V. Statistical Learning Theory (Wiley, New York, 1998).
Google Scholar
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining Inference, and Prediction (Springer, New York, 2001).
Book Google Scholar
Duda, R. O., Hart, P. E. & Stork, D. G. Pattern Classification (Wiley, New York, 2001).
Google Scholar
Wang, Z. et al. Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data. Bioinformatics 22, 755–761 (2006).
Article CAS PubMed Google Scholar
Poggio, T., Rifkin, R., Mukherjee, S. & Niyogi, P. General conditions for predictivity in learning theory. Nature 428, 419–422 (2004). Stability analysis to determine the generalization capability of supervised learning algorithms
Article CAS PubMed Google Scholar
Ripley, B. Pattern Recognition and Neural Networks (Cambridge Univ. Press, Cambridge, 1996).
Book Google Scholar
Wang, Y., Lin, S. H., Li, H. & Kung, S. Y. Data mapping by probabilistic modular networks and information theoretic criteria. IEEE Trans. Signal Processing 46, 3378–3397 (1998).
Article Google Scholar
Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441 and 498–520 (1933).
Article Google Scholar
Wang, A. & Gehan, E. A. Gene selection for microarray data analysis using principal component analysis. Stat. Med. 24, 2069–2087 (2005).
Article PubMed Google Scholar
Liu, A., Zhang, Y., Gehan, E. & Clarke, R. Block principal component analysis with application to gene microarray data classification. Stat. Med. 21, 3465–3474 (2002).
Article PubMed Google Scholar
Zhu, Y. et al. Estrogen receptor α (ER) positive breast tumors and breast cancer cell lines share similarities in their transcriptome data structures. Int. J. Oncol. 29, 1581–1589 (2006).
CAS PubMed Google Scholar

Download references

Acknowledgements

We wish to thank D. J. Miller (Department of Electrical Engineering, The Pennsylvania State University) for critical reading of the manuscript. Some of the issues we discuss may appear overly simplified to experts. Several of the emerging concepts have yet to appear in the biomedical literature and publications might not be accessible through PubMed (but are often found at an author's or journal's homepage or at CiteSeer). Many of the engineering and computer science works published in 'proceedings' represent peer-reviewed publications. This work was supported in part by Public Health Service grants R01-CA096483, U54-CA100970, R33-EB000830, R33-CA109872, 1P30-CA51008, R03-CA119313, and a U.S. Department of Defense Breast Cancer Research Program award BC030280.

Author information

Authors and Affiliations

Department of Oncology and Lombardi Comprehensive Cancer Center, Georgetown University School of Medicine, 3970 Reservoir Road NW, Washington, 20057, DC, USA
Robert Clarke, Habtom W. Ressom & Minetta C. Liu
Department of Physiology and Biophysics, Georgetown University School of Medicine, 3970 Reservoir Road NW, Washington, 20057, DC, USA
Robert Clarke
Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University School of Medicine, 3970 Reservoir Road NW, Washington, 20057, DC, USA
Habtom W. Ressom, Antai Wang & Edmund A. Gehan
Bradley Department of Electrical and Computer Engineering, School of Engineering and Sciences Virginia Polytechnic Institute and State University, 4300 Wilson Boulevard, Arlington, 22203, Virginia, USA
Jianhua Xuan & Yue Wang

Authors

Robert Clarke
View author publications
You can also search for this author in PubMed Google Scholar
Habtom W. Ressom
View author publications
You can also search for this author in PubMed Google Scholar
Antai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Xuan
View author publications
You can also search for this author in PubMed Google Scholar
Minetta C. Liu
View author publications
You can also search for this author in PubMed Google Scholar
Edmund A. Gehan
View author publications
You can also search for this author in PubMed Google Scholar
Yue Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert Clarke.

Glossary

t-Test: A significance test for assessing hypotheses about population means, usually a test of the equality of means of two independent populations.
False discovery rate: A univariate statistical method that controls the type I (false-positive) errors to correct for multiple testing.
Cluster: A cluster consists of a relatively high density of data points separated from other clusters by a relative low density of points; patterns within a cluster are more similar to each other than patterns belonging to different clusters.
Massively parallel: A large number of simultaneous processes; in biology, a living cell has multiple, concurrently active processes that are reflected in the proteome and its underlying transcriptome.
Trap of self-fulfilling prophesy: With thousands of measurements and the concurrent presence of multiple sub-phenotypes, intuitively logical but functionally incorrect associations may be implied between a signal's (gene or protein) perceived or known function in a biological system or phenotype of interest.
Euclidean space: Any mathematical space that is a generalization of the two- and three-dimensional spaces described by the axioms and definitions of Euclidean geometry, for example, properties of angles of plane triangles and of straight and parallel lines.
k-Means clustering algorithm: A method of cluster analysis in which, from an initial partition of the observations into k clusters, each observation in turn is examined and reassigned, if appropriate, to a different cluster, in an attempt to optimize a predefined numerical criterion that measures, in some sense, the quality of the cluster solution.
Vector: A coordinate-based data structure in which the information is represented by a magnitude and a direction.
Hard clustering: Any clustering method that forces a data point to belong only to a single cluster.
Test pattern: Also known as the testing set, this is a data point(s) that was not part of the training set, for example, in a leave-one-out approach the testing set is the sample that was left out during training.
Training set: The sample of observations from which a classification function is derived.
Null hypothesis: The hypothesis that there is no difference between the two groups for a variable that is being compared.
Family-wise error rate: The probability of making any error in a given family of inferences, rather than a per-comparison error rate or a per-experiment error rate.
Clustering algorithm: Procedure designed to find natural groupings or clusters in multidimensional data on the basis of measured or perceived similarities among the patterns.
Surrogate: A measure that substitutes for (and correlates with) a real endpoint but has no assured relationship; for example, tumour shrinkage in response to chemotherapy (surrogate) does not assure that the patient will live longer (endpoint).
Soft clustering: Any clustering method that allows a data point to be a member of more than one cluster.
Vector space: Space where data are represented by vectors that may be scaled and added as in linear algebra; two-dimensional Euclidean space is one form of vector space.
Metric space: A data space where the distance between each data point is specifically defined.
Hierarchical clustering: A series of models for a set of observations, where each model results from adding (or deleting) parameters from other models in the series.
Logistic regression: A method of analysis concerned with estimating the parameters in a postulated relationship between a response variable (binary for logistic regression) and one or more explanatory variables.
Hyperplane: A higher-dimensional generalization of the concepts of a plane in three-dimensional (or a line in two-dimensional) Euclidean geometry. A plane is a surface where, for any two points on the surface, the straight line that passes through those points also lies on the surface.
Kernel function: A mathematical transform operated upon one or multiple input variables; inner product or convolution is a popular form of kernel function.
Regression coefficient: A component of a statistical model in which a response variable is estimated by a number of explanatory variables, each combined with a regression coefficient that gives the estimated change in the response variable corresponding to a unit change in the appropriate explanatory variable.
Agglomerative hierarchical clustering: Methods of cluster analysis that begin with each individual in a separate cluster and then, in a series of steps, combine individuals, and later clusters, into new and larger clusters, until a final stage is reached where all individuals are members of a single group.
Missing value estimation: When an expected value is not reported for a specific gene or protein the missing value can be estimated and the estimated value used for data analysis.
Normally distributed: The value of a random variable(s) follows a probability-density function completely specified by the mean and variance.
Parametric statistical model: A statistical model, the probability distribution of which is specified by a relatively small set of quantitative parameters.
Subcubical neighbourhood: A smaller cubical area of a larger hypercubical space.
Hypercubical neighbourhood: A higher-dimensional generalization of the concepts of a cubic neighbourhood in three-dimensional Euclidean geometry.
Bootstrap re-sampling: A statistical method that iteratively uses subsets of the original data set to estimate the bias and variance for a classification algorithm.
Scale-free behaviour: The behaviour of an estimator is scale-free if it depends only on the ranks of the observations, for example, the estimator is equally accurate whether the logarithms of the observations or the values of the observations are used for analysis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Clarke, R., Ressom, H., Wang, A. et al. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer 8, 37–49 (2008). https://doi.org/10.1038/nrc2294

Download citation

Issue Date: January 2008
DOI: https://doi.org/10.1038/nrc2294

This article is cited by

Temporal gene expression profiling during early-stage traumatic temporomandibular joint bony ankylosis in a sheep model
- Tong-Mei Zhang
- Kun Yang
- Ying-Bin Yan
BMC Oral Health (2024)
Identification of core carcinogenic elements based on the age-standardized mortality rate of lung cancer in Xuanwei Formation coal in China
- Zailin Chen
- Xianfeng Cheng
- Junchun Hu
Scientific Reports (2024)
Differential bumble bee gene expression associated with pathogen infection and pollen diet
- Jonathan J. Giacomini
- Lynn S. Adler
- Rebecca E. Irwin
BMC Genomics (2023)
Constrained neuro fuzzy inference methodology for explainable personalised modelling with applications on gene expression data
- Balkaran Singh
- Maryam Doborjeh
- Nikola Kasabov
Scientific Reports (2023)
Machine learning in rare disease
- Jineta Banerjee
- Jaclyn N. Taroni
- Casey Greene
Nature Methods (2023)

The properties of high-dimensional data spaces: implications for exploring gene and protein expression data

Key Points

Abstract

Access options

Similar content being viewed by others

High-throughput proteomics: a methodological mini-review

Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer

Using clusterProfiler to characterize multiomics data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

This article is cited by

Temporal gene expression profiling during early-stage traumatic temporomandibular joint bony ankylosis in a sheep model

Identification of core carcinogenic elements based on the age-standardized mortality rate of lung cancer in Xuanwei Formation coal in China

Differential bumble bee gene expression associated with pathogen infection and pollen diet

Constrained neuro fuzzy inference methodology for explainable personalised modelling with applications on gene expression data

Machine learning in rare disease

Search

Quick links

Key Points

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Related links

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links