The application of several high-throughput genomic and proteomic technologies to address questions in cancer diagnosis, prognosis and prediction generate high-dimensional data sets.
The multimodality of high-dimensional cancer data, for example, as a consequence of the heterogeneous and dynamic nature of cancer tissues, the concurrent expression of multiple biological processes and the diverse and often tissue-specific activities of single genes, can confound both simple mechanistic interpretations of cancer biology and the generation of complete or accurate gene signal transduction pathways or networks.
The mathematical and statistical properties of high-dimensional data spaces are often poorly understood or inadequately considered. This can be particularly challenging for the common scenario where the number of data points obtained for each specimen greatly exceed the number of specimens.
Data are rarely randomly distributed in high-dimensions and are highly correlated, often with spurious correlations.
The distances between a data point and its nearest and farthest neighbours can become equidistant in high dimensions, potentially compromising the accuracy of some distance-based analysis tools.
Owing to the 'curse of dimensionality' phenomenon and its negative impact on generalization performance, for example, estimation instability, model overfitting and local convergence, the large estimation error from complex statistical models can easily compromise the prediction advantage provided by their greater representation power. Conversely, simpler statistical models may produce more reproducible predictions but their predictions may not always be adequate.
Some machine learning methods address the 'curse of dimensionality' in high-dimensional data analysis through feature selection and dimensionality reduction, leading to better data visualization and improved classification.
It is important to ensure that the generalization capability of classifiers derived by supervised learning methods from high-dimensional data before using them for cancer diagnosis, prognosis or prediction. Although this can be assessed initially through cross-validation methods, a more rigorous approach is needed, that is, to validate classifier performance using a blind validation data set(s) that was not used during supervised learning.
High-throughput genomic and proteomic technologies are widely used in cancer research to build better predictive models of diagnosis, prognosis and therapy, to identify and characterize key signalling networks and to find new targets for drug development. These technologies present investigators with the task of extracting meaningful statistical and biological information from high-dimensional data spaces, wherein each sample is defined by hundreds or thousands of measurements, usually concurrently obtained. The properties of high dimensionality are often poorly understood or overlooked in data modelling and analysis. From the perspective of translational science, this Review discusses the properties of high-dimensional data spaces that arise in genomic and proteomic studies and the challenges they can pose for data analysis and interpretation.
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Khan, J. et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Med. 7, 673–679 (2001). Example of the successful use of molecular profiling to improve cancer diagnosis.
Bhanot, G., Alexe, G., Levine, A. J. & Stolovitzky, G. Robust diagnosis of non-Hodgkin lymphoma phenotypes validated on gene expression data from different laboratories. Genome Inform. 16, 233–244 (2005).
Lin, Y. H. et al. Multiple gene expression classifiers from different array platforms predict poor prognosis of colorectal cancer. Clin. Cancer Res. 13, 498–507 (2007).
Lopez-Rios, F. et al. Global gene expression profiling of pleural mesotheliomas: overexpression of aurora kinases and P16/CDKN2A deletion as prognostic factors and critical evaluation of microarray-based prognostic prediction. Cancer Res. 66, 2970–2979 (2006).
Ganly, I. et al. Identification of angiogenesis/metastases genes predicting chemoradiotherapy response in patients with laryngopharyngeal carcinoma. J. Clin. Oncol. 25, 1369–1376 (2007).
Ayers, M. et al. Gene expression profiles predict complete pathologic response to neoadjuvant paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide chemotherapy in breast cancer. J. Clin. Oncol. 22, 2284–2293 (2004).
Van't Veer, L. J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002). Example of the use of molecular profiling for prognosis that led to the MammaPrint classification scheme for breast cancer.
van de Vijver, M. J. et al. A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med. 347, 1999–2009 (2002).
Gomez, B. P. et al. Human X-Box binding protein-1 confers both estrogen independence and antiestrogen resistance in breast cancer cell lines. FASEB J. 21, 4013–4027 (2007).
Meleth, S. et al. Novel approaches to smoothing and comparing SELDI TOF spectra. Cancer Inform. 1, 78–85 (2005).
Satagopan, J. M. & Panageas, K. S. A statistical perspective on gene expression data analysis. Stat. Med. 22, 481–499 (2003).
Allison, D. B., Cui, X., Page, G. P. & Sabripour, M. Microarray data analysis: from disarray to consolidation and consensus. Nature Rev. Genet. 7, 55–65 (2006).
Slonim, D. K. From patterns to pathways: gene expression data analysis comes of age. Nature Genet. 32, 502–508 (2002).
Liang, M. P., Troyanskaya, O. G., Laederach, A., Brutlag, D. L. & Altman, R. B. Computational functional genomics. Signal Processing Magazine IEEE 21, 62–69 (2004).
Tononi, G., Sporns, O. & Edelman, G. M. Measures of degeneracy and redundancy in biological networks. Proc. Natl Acad. Sci. 96, 3257–3262 (1999). Application of concepts from information theory to explore the complexity and robustness of biological networks.
Clarke, R., Leonessa, F., Welch, J. N. & Skaar, T. C. Cellular and molecular pharmacology of antiestrogen action and resistance. Pharmacol. Rev. 53, 25–71 (2001).
Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA 95, 14863–14868 (1998).
Perou, C. M. et al. Molecular portraits of human breast tumours. Nature 406, 747–752 (2000).
Jain, A. K., Duin, R. P. W. & Mao, J. Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Machine Intell. 22, 4–37 (2000).
Andersson, A. et al. Microarray-based classification of a consecutive series of 121 childhood acute leukemias: prediction of leukemic and genetic subtype as well as of minimal residual disease status. Leukemia 21, 1198–1203 (2007).
Gu, Z. et al. Association of interferon regulatory factor-1, nucleophosmin, nuclear factor-κB, and cyclic AMP response element binding with acquired resistance to faslodex (ICI 182,780). Cancer Res. 62, 3428–3437 (2002).
Huang, E. et al. Gene expression phenotypic models that predict the activity of oncogenic pathways. Nature Genet. 34, 226–230 (2003).
Golub, T. R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999).
Frey, B. J. & Dueck, D. Clustering by passing messages between data points. Science 315, 972–976 (2007).
Miller, D. J., Pal, S. & Wang, Y. Constraint-based transductive learning for distributed ensemble classification. Proc. 16th IEEE Workshop Machine Learning Signal Processing 15–20 (2006).
Tritton, T. R. & Yee, G. The anticancer agent adriamycin can be actively cytotoxic without entering cells. Science 217, 248–250 (1982).
Gewirtz, D. A. A critical evaluation of the mechanisms of action proposed for the antitumor effects of the anthracycline antibiotics adriamycin and daunorubicin. Biochem. Pharmacol. 57, 727–741 (1999).
Feldman, L. D., Hortobagyi, G. N., Buzdar, A. U., Ames, F. C. & Blumenschein, G. R. Pathological assessment of response to induction chemotherapy in breast cancer. Cancer Res. 46, 2578–2581 (1986).
Chollet, P. et al. Clinical and pathological response to primary chemotherapy in operable breast cancer. Eur. J. Cancer 33, 862–866 (1997).
Chollet, P. et al. Prognostic significance of a complete pathological response after induction chemotherapy in operable breast cancer. Br. J. Cancer 86, 1041–1046 (2002).
Bellman, R. Adaptive Control Processes: A Guided Tour (Princeton Univ., Princeton, 1961).
Chavez, E. & Navarro, G. In Algorithm Engineering and Experimentation (eds Buchsbaum, A. L. & Snoeyink, J.) 147–160 (Springer, Heidelberg, 2001).
Scott, D. W. Multivariate Density Estimation (John Wiley, Hoboken, 1992).
Hinneburg, A. & Keim, D. A. Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. Proc. 25th Int. Very Large Database Conf. 506–517 (1999).
Dudoit, S., Fridlyand, J. & Speed, T. P. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 97, 77–87 (2002).
Haykin, S. Neural Networks: A Comprehensive Foundation (Prentice Hall, Upper Saddle River, 1999).
Lee, M. L., Kuo, F. C., Whitmore, G. A. & Sklar, J. Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl Acad. Sci. USA 97, 9834–9839 (2000).
Jain, A. K. & Chandrasekaran, B. Dimensionality and sample size considerations in pattern recognition practice in Handbook of Statistics (eds Krishnaiah, P. R. & Kanal, L. N.) 835–855 (North-Holland, Amsterdam, 1982).
Peduzzi, P., Concato, J., Kemper, E., Holford, T. R. & Feinstein, A. R. A simulation study of the number of events per variable in logistic regression analysis. J. Clin. Epidemiol. 49, 1373–1379 (1996).
Ciaccia, P. & Patella, M. PAC nearest neighbor queries: approximate and controlled search in high dimensional and metric spaces. IEEE Comp. Soc. Proc. 16th Int. Conf. Data Engineering (ICDE) 244–255 (2000).
Weber, R., Schek, H.-J. & Blott, S. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. Proc. 24th Int. Very Large Database Conf. 194–205 (1998).
Agrawal, R., Gehrke, J., Gunopulos, D. & Raghavan, P. Automatic subspace clustering of high dimensional data for data mining applications. Proc. 1998 ACM-SIGMOD Int. Conf. Management Data 94–105 (1998).
Beyer, K., Goldstein, J., Ramakrishnan, R. & Shaft, U. When is “nearest neighbor” meaningful? Proc. 7th Int. Conf. Database Theory 217–235 (1999). A mathematical demonstration of the effects of dimensionality on the equalization of distance measures in Euclidean spaces.
Millman, V. D. A new proof of the theorem of A. Dvoretzky on sections of convex bodies. Funct. Anal. Appl. 5, 28–37 (1971).
Zhou, X., Wang, X. & Dougherty, E. R. Missing-value estimation using linear and non-linear regression with Bayesian gene selection. Bioinformatics 19, 2302–2307 (2003).
Donoho, D. L. High-dimensional data analysis: the curses and blessings of dimensionality. Stanford University Department of Statistics [online], (2000).
Marshall, E. Getting the noise out of gene arrays. Science 306, 630–631 (2004).
Wang, Y., Lu, J., Lee, R., Gu, Z. & Clarke, R. Iterative normalization of cDNA microarray data. IEEE Trans. Inf. Technol. Biomed. 6, 29–37 (2002).
Chen, Y., Dougherty, E. R. & Bittner M. L. Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Optics. 24, 364–374 (1997).
Bolstad, B. M., Irizarry, R. A., Astrand, M. & Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193 (2003). Probe intensity level methods for normalization of Affymetrix microarrays. A direct comparison is made between three widely used methods for reducing non-biological variation.
Lei, X., Bandyopadhyay, A., Le, T. & Sun, L. Autocrine TGFβ supports growth and survival of human breast cancer MDA-MB-231 cells. Oncogene. 21, 7514–7523 (2002).
Kamaraju, A. K. & Roberts, A. B. Role of Rho/ROCK and p38 MAP kinase pathways in transforming growth factor-β-mediated Smad-dependent growth inhibition of human breast carcinoma cells in vivo. J. Biol. Chem. 280, 1024–1036 (2005).
Guise, T. A. & Chirgwin, J. M. Transforming growth factor-β in osteolytic breast cancer bone metastases. Clin. Orthop. Relat. Res. 415, S32–S38 (2003).
Fox, S. W. & Lovibond, A. C. Current insights into the role of transforming growth factor-β in bone resorption. Mol. Cell Endocrinol. 243, 19–26 (2005).
Deckers, M. et al. The tumor suppressor Smad4 is required for transforming growth factor β-induced epithelial to mesenchymal transition and bone metastasis of breast cancer cells. Cancer Res. 66, 2202–2209 (2006).
Bailey, J. P. et al. Prolactin and transforming growth factor-β signaling exert opposing effects on mammary gland morphogenesis, involution, and the Akt-forkhead pathway. Mol. Endocrinol. 18, 1171–1184 (2004).
Arteaga, C. L., Koli, K. M., Dugger, T. C. & Clarke, R. Reversal of tamoxifen resistance of human breast carcinomas in vivo with neutralizing anti-transforming growth factor (TGF)-β antibodies involves paracrine mechanisms. J. Natl Cancer Inst. 91, 46–53 (1999).
Martone, R. et al. Distribution of NF-κB-binding sites across human chromosome 22. Proc. Natl Acad. Sci. USA 100, 12247–12252 (2003).
Holst, F. et al. Estrogen receptor α (ESR1) gene amplification is frequent in breast cancer. Nature Genet. 39, 655–660 (2007).
Carroll, J. S. et al. Genome-wide analysis of estrogen receptor binding sites. Nature Genet. 38, 1289–1297 (2006).
Clarke, R. & Brünner, N. Cross-resistance and molecular mechanisms in antiestrogen resistance. Endocr. Related Cancer 2, 59–72 (1995).
Toledo, F. & Wahl, G. M. Regulating the p53 pathway: in vitro hypotheses, in vivo veritas. Nature Rev. Cancer 6, 909–923 (2006).
Weisz, L., Oren, M. & Rotter, V. Transcription regulation by mutant p53. Oncogene 26, 2202–2211 (2007).
Clarke, R., Dickson, R. B. & Lippman, M. E. Hormonal aspects of breast cancer: growth factors, drugs and stromal interactions. Crit. Rev. Oncol. Hematol. 12, 1–23 (1992).
Shipitsin, M. et al. Molecular definition of breast tumor heterogeneity. Cancer Cell 11, 259–273 (2007).
Qiu, X., Brooks, A. I., Klebanov, L. & Yakovlev, N. The effects of normalization on the correlation structure of microarray data. BMC Bioinform 6, 120 (2005).
Caldarelli, G., Pastor-Satorras, R. & Vespignani, A. Structure of cycles and local ordering in complex networks. Eur. Phys. J. B 38, 183–186 (2004).
Gan, X., Liew, A. W. & Yan, H. Microarray missing data imputation based on a set theoretic framework and biological knowledge. Nucleic Acids Res. 34, 1608–1619 (2006).
Bar-Joseph, Z., Farkash, S., Gifford, D. K., Simon, I. & Rosenfeld, R. Deconvolving cell cycle expression data with complementary information. Bioinformatics 20 (Suppl. 1), I23–I30 (2004).
Clarke, R. B. & Smith, G. H. Stem cells and tissue homeostasis in mammary glands. J. Mammary Gland Biol. Neoplasia 10, 1–3 (2005).
Al-Hajj, M. Cancer stem cells and oncology therapeutics. Curr. Opin. Oncol. 19, 61–64 (2007).
Dontu, G., Al-Hajj, M., Abdallah, W. M., Clarke, M. F. & Wicha, M. S. Stem cells in normal breast development and breast cancer. Cell Prolif. 36 (Suppl. 1), 59–72 (2003).
Csanyi, G. & Szendroi, B. Fractal small-world dichotomy in real-world networks. Phys. Rev. E (Statistical, Nonlinear, and Soft Matter Physics) 70, 01612-2–01612-5 (2004).
Watts, D. J. & Strogatz, S. H. Collective dynamics of 'small-world' networks. Nature 393, 440–442 (1998).
Newman, M. E. J. The structure and function of complex networks. Soc. Indust. Appl. Math. Rev. 45, 167–256 (2003).
Barabasi, A. L. & Albert, R. Emergence of scaling in random networks. Science 286, 509–512 (1999).
van, Noort, V, Snel, B. & Huynen, M. A. The yeast coexpression network has a small-world, scale-free architecture and can be explained by a simple model. EMBO Rep. 5, 280–284 (2004).
Bray, D. Molecular networks: the top-down view. Science 301, 1864–1865 (2003).
Alon, U. Biological networks: the tinkerer as an engineer. Science 301, 1866–1867 (2003).
Qiu, P., Wang, Z. J., Liu, K. J., Hu, Z. Z. & Wu, C. H. Dependence network modeling for biomarker identification. Bioinformatics 23, 198–206 (2007).
de la Fuente, A., Bing, N., Hoeschele, I. & Mendes, P. Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 20, 3565–3574 (2004).
Hirschhorn, J. N. & Daly, M. J. Genome-wide association studies for common diseases and complex traits. Nature Rev. Genet. 6, 95–108 (2005).
Ritchie, M. D. et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 69, 138–147 (2001).
Ransohoff, D. F. Bias as a threat to the validity of cancer molecular-marker research. Nature Rev. Cancer 5, 142–149 (2005).
Miller, D. J., Wang, Y. & Kesidis, G. Emergent unsupervised clustering paradigms with potential application to bioinformatics. Front. Biosci. 13, 677–690 (2008).
Clarke, R. et al. Antiestrogen resistance in breast cancer and the role of estrogen receptor signaling. Oncogene 22, 7316–7339 (2003).
Kamby, C. et al. Histological grade and steroid receptor content of primary breast cancer — impact on prognosis and possible modes of action. Br. J. Cancer 58, 480–486 (1988).
Aamdal, S. et al. Estrogen receptor and long term prognosis in breast cancer. Cancer 53, 2525–2529 (1984).
Early Breast Cancer Trialists' Collaborative Group. Tamoxifen for early breast cancer. Cochrane Database Syst. Rev. 1, CD000486 (2001).
Ressom, H. W., Varghese, R. S., Zhang, Z., Xuan, J. & Clarke, R. Classification algorithms for phenotype prediction in genomics and proteomics. Front. Biosci. 13, 691–708 (2008).
Guyon, J., Weston, J., Barnhill, M. D. & Vapnik, V. Gene selection for cancer classification using support vector machines. Machine Learning 46, 389–422 (2003). A recursive feature elimination method to select discriminant genes for SVM-based cancer classification.
Pudil, P., Novovicova, J. & Kittler, J. Floating search methods in feature selection. Pattern Recogn Lett. 15, 1119–1125 (1994).
Somol, P., Pudil, P., Novovicova, J. & Paclik, P. Adaptive floating search methods in feature selection. Pattern Recogn Lett. 20, 1157–1163 (2007).
Velculescu, V. E., Zhang, L., Vogelstein, B. & Kinzler, K. W. Serial analysis of gene expression. Science 270, 484–487 (1995).
Somiari, R. I. et al. High-throughput proteomic analysis of human infiltrating ductal carcinoma of the breast. Proteomics 3, 1863–1873 (2003).
Skaar, T. C. et al. Two-dimensional gel electrophoresis analyses identify nucleophosmin as an estrogen-regulated protein associated with acquired estrogen-independence in human breast cancer cells. J. Steroid Biochem. Mol. Biol. 67, 391–402 (1998).
Haab, B. B. Antibody arrays in cancer research. Mol. Cell Proteomics. 4, 377–383 (2005).
Tirkkonen, M. et al. Molecular cytogenetics of primary breast cancer by CGH. Genes Chromosomes Cancer 21, 177–184 (1998).
Chin, K. et al. Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell 10, 529–541 (2006).
Scharpf, R. B., Ting, J. C., Pevsner, J. & Ruczinski, I. SNPchip: R. classes and methods for SNP array data. Bioinformatics 23, 627–628 (2007).
Huang, T. H., Perry, M. R. & Laux, D. E. Methylation profiling of CpG islands in human breast cancer cells. Hum. Mol. Genet. 8, 459–470 (1999).
Ren, B. et al. Genome-wide location and function of DNA binding proteins. Science 290, 2306–2309 (2000).
Claverie, J. M. Computational methods for the identification of differential and coordinated gene expression. Hum. Mol. Genet. 8, 1821–1832 (1999).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate — a practical and powerful approach to multiple testing. J. Royal Statl Soc. Series B Method. 57, 289–300 (1995).
Storey, J. D. & Tibshirani, R. Statistical methods for identifying differentially expressed genes in DNA microarrays. Methods Mol. Biol. 224, 149–157 (2003).
Taylor, J., Tibshirani, R. & Efron, B. The 'miss rate' for the analysis of gene expression data. Biostatistics 6, 111–117 (2005).
Wright, G. W. & Simon, R. M. A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics 19, 2448–2455 (2003).
Benjamini, Y. & Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001).
Romano, J. P. & Wolf, M. Control of generalized error rates in multiple testing. Ann. Statis. 35, 1378–1408 (2007). A recursive feature elimination method to select discriminant genes for SVM-based cancer classification.
Vapnik, V. Statistical Learning Theory (Wiley, New York, 1998).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining Inference, and Prediction (Springer, New York, 2001).
Duda, R. O., Hart, P. E. & Stork, D. G. Pattern Classification (Wiley, New York, 2001).
Wang, Z. et al. Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data. Bioinformatics 22, 755–761 (2006).
Poggio, T., Rifkin, R., Mukherjee, S. & Niyogi, P. General conditions for predictivity in learning theory. Nature 428, 419–422 (2004). Stability analysis to determine the generalization capability of supervised learning algorithms
Ripley, B. Pattern Recognition and Neural Networks (Cambridge Univ. Press, Cambridge, 1996).
Wang, Y., Lin, S. H., Li, H. & Kung, S. Y. Data mapping by probabilistic modular networks and information theoretic criteria. IEEE Trans. Signal Processing 46, 3378–3397 (1998).
Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441 and 498–520 (1933).
Wang, A. & Gehan, E. A. Gene selection for microarray data analysis using principal component analysis. Stat. Med. 24, 2069–2087 (2005).
Liu, A., Zhang, Y., Gehan, E. & Clarke, R. Block principal component analysis with application to gene microarray data classification. Stat. Med. 21, 3465–3474 (2002).
Zhu, Y. et al. Estrogen receptor α (ER) positive breast tumors and breast cancer cell lines share similarities in their transcriptome data structures. Int. J. Oncol. 29, 1581–1589 (2006).
We wish to thank D. J. Miller (Department of Electrical Engineering, The Pennsylvania State University) for critical reading of the manuscript. Some of the issues we discuss may appear overly simplified to experts. Several of the emerging concepts have yet to appear in the biomedical literature and publications might not be accessible through PubMed (but are often found at an author's or journal's homepage or at CiteSeer). Many of the engineering and computer science works published in 'proceedings' represent peer-reviewed publications. This work was supported in part by Public Health Service grants R01-CA096483, U54-CA100970, R33-EB000830, R33-CA109872, 1P30-CA51008, R03-CA119313, and a U.S. Department of Defense Breast Cancer Research Program award BC030280.
A significance test for assessing hypotheses about population means, usually a test of the equality of means of two independent populations.
- False discovery rate
A univariate statistical method that controls the type I (false-positive) errors to correct for multiple testing.
A cluster consists of a relatively high density of data points separated from other clusters by a relative low density of points; patterns within a cluster are more similar to each other than patterns belonging to different clusters.
- Massively parallel
A large number of simultaneous processes; in biology, a living cell has multiple, concurrently active processes that are reflected in the proteome and its underlying transcriptome.
- Trap of self-fulfilling prophesy
With thousands of measurements and the concurrent presence of multiple sub-phenotypes, intuitively logical but functionally incorrect associations may be implied between a signal's (gene or protein) perceived or known function in a biological system or phenotype of interest.
- Euclidean space
Any mathematical space that is a generalization of the two- and three-dimensional spaces described by the axioms and definitions of Euclidean geometry, for example, properties of angles of plane triangles and of straight and parallel lines.
- k-Means clustering algorithm
A method of cluster analysis in which, from an initial partition of the observations into k clusters, each observation in turn is examined and reassigned, if appropriate, to a different cluster, in an attempt to optimize a predefined numerical criterion that measures, in some sense, the quality of the cluster solution.
A coordinate-based data structure in which the information is represented by a magnitude and a direction.
- Hard clustering
Any clustering method that forces a data point to belong only to a single cluster.
- Test pattern
Also known as the testing set, this is a data point(s) that was not part of the training set, for example, in a leave-one-out approach the testing set is the sample that was left out during training.
- Training set
The sample of observations from which a classification function is derived.
- Null hypothesis
The hypothesis that there is no difference between the two groups for a variable that is being compared.
- Family-wise error rate
The probability of making any error in a given family of inferences, rather than a per-comparison error rate or a per-experiment error rate.
- Clustering algorithm
Procedure designed to find natural groupings or clusters in multidimensional data on the basis of measured or perceived similarities among the patterns.
A measure that substitutes for (and correlates with) a real endpoint but has no assured relationship; for example, tumour shrinkage in response to chemotherapy (surrogate) does not assure that the patient will live longer (endpoint).
- Soft clustering
Any clustering method that allows a data point to be a member of more than one cluster.
- Vector space
Space where data are represented by vectors that may be scaled and added as in linear algebra; two-dimensional Euclidean space is one form of vector space.
- Metric space
A data space where the distance between each data point is specifically defined.
- Hierarchical clustering
A series of models for a set of observations, where each model results from adding (or deleting) parameters from other models in the series.
- Logistic regression
A method of analysis concerned with estimating the parameters in a postulated relationship between a response variable (binary for logistic regression) and one or more explanatory variables.
A higher-dimensional generalization of the concepts of a plane in three-dimensional (or a line in two-dimensional) Euclidean geometry. A plane is a surface where, for any two points on the surface, the straight line that passes through those points also lies on the surface.
- Kernel function
A mathematical transform operated upon one or multiple input variables; inner product or convolution is a popular form of kernel function.
- Regression coefficient
A component of a statistical model in which a response variable is estimated by a number of explanatory variables, each combined with a regression coefficient that gives the estimated change in the response variable corresponding to a unit change in the appropriate explanatory variable.
- Agglomerative hierarchical clustering
Methods of cluster analysis that begin with each individual in a separate cluster and then, in a series of steps, combine individuals, and later clusters, into new and larger clusters, until a final stage is reached where all individuals are members of a single group.
- Missing value estimation
When an expected value is not reported for a specific gene or protein the missing value can be estimated and the estimated value used for data analysis.
- Normally distributed
The value of a random variable(s) follows a probability-density function completely specified by the mean and variance.
- Parametric statistical model
A statistical model, the probability distribution of which is specified by a relatively small set of quantitative parameters.
- Subcubical neighbourhood
A smaller cubical area of a larger hypercubical space.
- Hypercubical neighbourhood
A higher-dimensional generalization of the concepts of a cubic neighbourhood in three-dimensional Euclidean geometry.
- Bootstrap re-sampling
A statistical method that iteratively uses subsets of the original data set to estimate the bias and variance for a classification algorithm.
- Scale-free behaviour
The behaviour of an estimator is scale-free if it depends only on the ranks of the observations, for example, the estimator is equally accurate whether the logarithms of the observations or the values of the observations are used for analysis.
About this article
Cite this article
Clarke, R., Ressom, H., Wang, A. et al. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer 8, 37–49 (2008). https://doi.org/10.1038/nrc2294
Harvestman: a framework for hierarchical feature learning and selection from whole genome sequencing data
BMC Bioinformatics (2021)
BMC Bioinformatics (2021)
Systemic alterations play a dominant role in epigenetic predisposition to breast cancer in offspring of obese fathers and is transmitted to a second generation
Scientific Reports (2021)
Bayesian Hyper-LASSO Classification for Feature Selection with Application to Endometrial Cancer RNA-seq Data
Scientific Reports (2020)
MFAP2 is overexpressed in gastric cancer and promotes motility via the MFAP2/integrin α5β1/FAK/ERK pathway