Rules of evidence for cancer molecular-marker discovery and validation


According to some claims, molecular markers are set to revolutionize the process of evaluating prognosis and diagnosis for cancer. Research about cancer markers has, however, been characterized by inflated expectations, followed by disappointment when original results can not be reproduced. Even now, disappointment might be expected, in part because rules of evidence to assess the validity of studies about diagnosis and prognosis are both underdeveloped and not routinely applied. What challenges are involved in assessing studies and how might problems be avoided so as to realize the full potential of this emerging technology?

Figure 1: Method of dividing original sample to assess reproducibility and overfitting.


Thanks to many colleagues at the National Cancer Institute, The University of North Carolina at Chapel Hill and elsewhere for reviewing and commenting on earlier versions of the manuscript.

The author declares no competing financial interests.

Research about cancer molecular markers



A technique used in multivariable analysis that is intended to reduce the possibility of overfitting and of non-reproducible results. The method involves sequentially leaving out parts of the original sample ('split-sample') and conducting a multivariable analysis; the process is repeated until the entire sample has been assessed. The results are combined into a final model that is the product of the training step.


Research in which large amounts of data are examined, without prior hypothesis, to discover markers or patterns that might discriminate among groups of individuals.


Research in which large numbers of variables are analysed simultaneously. RNA expression analysis using microarrays simultaneously examines expression levels of tens of thousands of genes. Proteomic analysis of serum using mass spectroscopy simultaneously examines thousands of peaks related to proteins and peptides.


A solid surface on which thousands of specimens, such as synthetic oligonucleotides representing different genes, can be placed in separate locations and used to assess the status of genotype or gene expression for one individual.


Models that simultaneously consider how multiple variables — such as age, gender, co-morbidity, symptoms and gene expression — relate to an outcome such as diagnosis or prognosis.


Finding a discriminatory pattern by chance, which can happen when large numbers of variables are assessed for a small number of outcomes.


(PCR). A method to replicate or amplify small amounts of DNA into larger amounts that can be used in chemical analysis.


Rules that are used to evaluate the strength or validity of research results by considering problems such as heterogeneity, complexity, bias and 'generalizeability'. Rules vary depending on the subject or purpose of the study: diagnosis, prognosis, therapy or aetiology.


(SAGE). A method to estimate numbers of copies of genes.


(SNP). Variations involving a single base.


Split sample validation is used, confusingly, to mean two different things. It can refer to the method in the training step by which the sample is divided during the process of cross-validation. It can also refer to the method used to divide the original sample of subjects into two groups for use in training and then in independent validation.


Refers in general to efforts that are made to confirm the accuracy, precision or effectiveness of results, including reproducibility.

Ransohoff, D. Rules of evidence for cancer molecular-marker discovery and validation. Nat Rev Cancer 4, 309–314 (2004).

