Review Article | Published:

Validating, augmenting and refining genome-wide association signals

Nature Reviews Genetics volume 10, pages 318329 (2009) | Download Citation


Studies using genome-wide platforms have yielded an unprecedented number of promising signals of association between genomic variants and human traits. This Review addresses the steps required to validate, augment and refine such signals to identify underlying causal variants for well-defined phenotypes. These steps include: large-scale exact replication across both similar and diverse populations; fine mapping and resequencing; determination of the most informative markers and multiple independent informative loci; incorporation of functional information; and improved phenotype mapping of the implicated genetic effects. Even in cases for which replication proves that an effect exists, confident localization of the causal variant often remains elusive.

Key points

  • Genome-wide association studies have yielded a large number of association signals with robust statistical support, but these are only markers of the true functional variants.

  • Reliable identification of the true functional variants can be notoriously difficult, but a series of methods could be helpful in this regard.

  • Large-scale exact replication to achieve robust statistical credibility of a marker should precede efforts at finding the causative variants.

  • Fine mapping and resequencing might help to identify more informative markers and multiple independent informative loci.

  • Functional information could fine tune the credibility of different variants for being the causative variant.

  • Additional insights might be obtained by more extensive phenotype mapping of proposed variants.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.


  1. 1.

    et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Rev. Genet. 9, 356–369 (2008). A comprehensive review of challenges in the discovery of associations using GWA studies.

  2. 2.

    , & A HapMap harvest of insights into the genetics of common disease. J. Clin. Invest. 118, 1590–1605 (2008).

  3. 3.

    & Genome-based prediction of common diseases: advances and prospects. Hum. Mol. Genet. 17, R166–R173 (2008).

  4. 4.

    , , , & Genome-wide significance for dense SNP and resequencing data. Genet. Epidemiol. 32, 179–185 (2008).

  5. 5.

    , , & Estimation of the multiple testing burden for genomewide association studies of nearly all common variants. Genet. Epidemiol. 32, 381–385 (2008).

  6. 6.

    , , , & Fine mapping versus replication in whole-genome association studies. Am. J. Hum. Genet. 81, 995–1005 (2007).

  7. 7.

    , , & A Catalog of Published Genome-Wide Association Studies. National Human Genome Research Institute [online] , (2009). A continuously updated online list of GWA studies and their main results.

  8. 8.

    , & Genetic mapping in human disease. Science 322, 881–888 (2008).

  9. 9.

    & Meta-analysis of genome-wide association studies. Pharmacogenomics 10, 191–201 (2009).

  10. 10.

    et al. Practical aspects of imputation-driven meta-analysis of genome-wide association studies. Hum. Mol. Genet. 17, R122–R128 (2008).

  11. 11.

    et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nature Genet. 40, 638–645 (2008). An early paradigm of the application of meta-analysis in combining several GWA data sets and subsequent replication studies.

  12. 12.

    et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nature Genet. 40, 955–962 (2008).

  13. 13.

    The GIANT consortium. Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nature Genet. 41, 25–34 (2009).

  14. 14.

    et al. The emergence of networks in human genome epidemiology: challenges and opportunities. Epidemiology 18, 1–8 (2007).

  15. 15.

    , & Optimal multistage designs—a general framework for efficient genome-wide association studies. Biostatistics 10, 297–309 (2009).

  16. 16.

    , , & Probability that a two-stage genome-wide association study will detect a disease-associated SNP and implications for multistage designs. Ann. Hum. Genet. 72, 812–820 (2008).

  17. 17.

    , , & Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature Genet. 38, 209–213 (2006).

  18. 18.

    , , , & A comprehensive evaluation of SNP genotype imputation. Hum. Genet. 125, 163–171 (2009).

  19. 19.

    & Practical issues in imputation-based association mapping. PLoS Genet. 4, e1000279 (2008).

  20. 20.

    et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genet. 39, 906–913 (2007).

  21. 21.

    Missing data imputation and haplotype phase inference for genome-wide association studies. Hum. Genet. 124, 439–450 (2008).

  22. 22.

    & Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007).

  23. 23.

    , , & Meta-analysis methods. Adv. Genet. 60, 311–334 (2008).

  24. 24.

    & Methods for meta-analysis in genetic association studies: a review of their potential and pitfalls. Hum. Genet. 123, 1–14 (2008).

  25. 25.

    , , , & Methods for Meta-Analysis in Medical Research (Wiley, Chichester, 2000).

  26. 26.

    & Recent developments in meta-analysis. Stat. Med. 27, 625–650 (2008).

  27. 27.

    , & Bayesian Approaches to Clinical Trials and Health-Care Evaluation Ch. 8, 267–305 (Wiley, Chichester, 2004).

  28. 28.

    , , & Bayesian meta-analysis and meta-regression for gene–disease associations and deviations from Hardy–Weinberg equilibrium. Stat. Med. 26, 553–567 (2007).

  29. 29.

    , et al. Can trial sequential monitoring boundaries reduce spurious inferences from meta-analyses? Int. J. Epidemiol. 38, 276–286 (2009).

  30. 30.

    & Overcoming the winner's curse: estimating penetrance parameters from case–control data. Am. J. Hum. Genet. 80, 605–615 (2007). A thorough presentation of the winner's curse and of the proposed approach for correcting for it.

  31. 31.

    Why most discovered true associations are inflated. Epidemiology 19, 640–648 (2008).

  32. 32.

    , , & Required sample size and nonreplicability thresholds for heterogeneous genetic associations. Proc. Natl Acad. Sci. USA 105, 617–622 (2008).

  33. 33.

    , & Uncertainty in heterogeneity estimates in meta-analyses. BMJ 335, 914–916 (2007).

  34. 34.

    Non-replication and inconsistency in the genome-wide association setting. Hum. Hered. 64, 203–213 (2007).

  35. 35.

    Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).

  36. 36.

    et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904–909 (2006).

  37. 37.

    et al. Evaluation of the potential excess of statistically significant findings in published genetic association studies: application to Alzheimer's disease. Am. J. Epidemiol. 168, 855–865 (2008).

  38. 38.

    Linkage disequilibrium — understanding the evolutionary past and mapping the medical future. Nature Rev. Genet. 9, 477–485 (2008).

  39. 39.

    International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).

  40. 40.

    et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nature Genet. 40, 1166–1174 (2008).

  41. 41.

    , & 'Racial' differences in genetic effects for complex diseases. Nature Genet. 36, 1312–1318 (2004).

  42. 42.

    et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 447, 1087–1093 (2007).

  43. 43.

    et al. Implication of genetic variants near TCF7L2, SLC30A8, HHEX, CDKAL1, CDKN2A/B, IGF2BP2, and FTO in type 2 diabetes and obesity in 6,719 Asians. Diabetes 57, 2226–2233 (2008).

  44. 44.

    et al. Variants conferring risk of atrial fibrillation on chromosome 4q25. Nature 448, 353–357 (2007).

  45. 45.

    et al. Association analysis of the FTO gene with obesity in children of Caucasian and African ancestry reveals a common tagging SNP. PLoS ONE 3, e1746 (2008).

  46. 46.

    et al. Variants in the fat mass- and obesity-associated (FTO) gene are not associated with obesity in a Chinese Han population. Diabetes 57, 264–268 (2008).

  47. 47.

    et al. Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nature Genet. 38, 320–323 (2006).

  48. 48.

    et al. Refining the impact of TCF7L2 gene variants on type 2 diabetes and adaptive evolution. Nature Genet. 39, 218–225 (2007).

  49. 49.

    & An utter refutation of the 'Fundamental Theorem of the HapMap'. Eur. J. Hum. Genet. 14, 426–437 (2006).

  50. 50.

    & An utter refutation of the 'Fundamental Theorem of the HapMap' by Terwilliger and Hiekkalina. Eur. J. Hum. Genet. 14, 1238–1239 (2006).

  51. 51.

    Introduction to Psychological Measurement (McGraw–Hill, New York, 1970).

  52. 52.

    et al. A nonsynonymous functional variant in integrin-αM (encoded by ITGAM) is associated with systemic lupus erythematosus. Nature Genet. 40, 152–154 (2008).

  53. 53.

    et al. A common variant associated with prostate cancer in European and African populations. Nature Genet. 38, 652–658 (2006).

  54. 54.

    et al. Admixture mapping identifies 8q24 as a prostate cancer risk locus in African–American men. Proc. Natl Acad. Sci. USA 103, 14068–14073 (2006).

  55. 55.

    et al. Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nature Genet. 39, 645–649 (2007).

  56. 56.

    et al. Multiple regions within 8q24 independently affect risk for prostate cancer. Nature Genet. 39, 638–644 (2007).

  57. 57.

    et al. Genome-wide association scan identifies a colorectal cancer susceptibility locus on chromosome 8q24. Nature Genet. 39, 989–994 (2007).

  58. 58.

    et al. Multiple loci with different cancer specificities within the 8q24 gene desert. J. Natl. Cancer Inst. 100, 962–966 (2008).

  59. 59.

    et al. Genome-wide association study identifies a second prostate cancer susceptibility variant at 8q24. Nature Genet. 39, 631–637 (2007).

  60. 60.

    et al. Sequence variant on 8q24 confers susceptibility to urinary bladder cancer. Nature Genet. 40, 1307–1312 (2008).

  61. 61.

    et al. A range of cancers is associated with the rs6983267 marker on chromosome 8. Cancer Res. 68, 9982–9986 (2008).

  62. 62.

    et al. Associations between variants of the 8q24 chromosome and nine smoking-related cancer sites. Cancer Epidemiol. Biomarkers Prev. 17, 3193–3202 (2008).

  63. 63.

    et al. Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals. Nature 434, 338–345 (2005).

  64. 64.

    et al. High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genet. 4, e1000214 (2008).

  65. 65.

    et al. Heritability and tissue specificity of expression quantitative trait loci. PLoS Genet. 2, e172 (2006).

  66. 66.

    et al. Novel Crohn disease locus identified by genome-wide association maps to a gene desert on 5p13.1 and modulates expression of PTGER4. PLoS Genet. 3, e58 (2007).

  67. 67.

    International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (2007). A description of the second generation of the HapMap.

  68. 68.

    , & Next-generation sequencing: from basic research to diagnostics. Clin. Chem. 26 Feb 2009 (doi:10.1373/clinchem.2008.112789).

  69. 69.

    et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).

  70. 70.

    A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am. J. Hum. Genet. 74, 765–769 (2004).

  71. 71.

    & Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation matrix. Heredity 95, 221–227 (2005).

  72. 72.

    An efficient Monte Carlo approach to assessing statistical significance in genomic studies. Bioinformatics 21, 781–787 (2005).

  73. 73.

    et al. Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn's disease. Nature Genet. 40, 1107–1120 (2008).

  74. 74.

    , , , & Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. Am. J. Hum. Genet. 82, 100–112 (2008).

  75. 75.

    , & Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. Am. J. Hum. Genet. 80, 727–739 (2007).

  76. 76.

    et al. Mutations in the human melanocortin-4 receptor gene associated with severe familial obesity disrupts receptor function through multiple molecular mechanisms. Hum. Mol. Genet. 12, 561–574 (2003).

  77. 77.

    et al. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305, 869–872 (2004).

  78. 78.

    et al. Association of the T-cell regulatory gene CTLA4 with susceptibility to autoimmune disease. Nature 423, 506–511 (2003).

  79. 79.

    , & Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat. Med. 15, 361–387 (1996).

  80. 80.

    & A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am. J. Hum. Genet. 73, 1162–1169 (2003).

  81. 81.

    et al. Three functional variants of IFN regulatory factor 5 (IRF5) define risk and protective haplotypes for human lupus. Proc. Natl Acad. Sci. USA 104, 6758–6763 (2007).

  82. 82.

    et al. Comprehensive evaluation of the genetic variants of interferon regulatory factor 5 (IRF5) reveals a novel 5 bp length polymorphism as strong risk factor for systemic lupus erythematosus. Hum. Mol. Genet. 17, 872–881 (2008).

  83. 83.

    et al. Different genetic effects of interferon regulatory factor 5 (IRF5) polymorphisms on systemic lupus erythematosus in a Korean population. J. Rheumatol. 35, 2148–2151 (2008).

  84. 84.

    et al. Association of IRF5 polymorphisms with systemic lupus erythematosus in a Japanese population: support for a crucial role of intron 1 polymorphisms. Arthritis Rheum. 58, 826–834 (2008).

  85. 85.

    et al. CFH haplotypes without the Y402H coding variant show strong association with susceptibility to age-related macular degeneration. Nature Genet. 38, 1049–1054 (2006).

  86. 86.

    et al. Common variation in three genes, including a noncoding variant in CFH, strongly influences risk of age-related macular degeneration. Nature Genet. 38, 1055–1059 (2006).

  87. 87.

    et al. Coding and noncoding variants in the CFH gene and cigarette smoking influence the risk of age-related macular degeneration in a Japanese population. Invest. Ophthalmol. Vis. Sci. 48, 5315–5319 (2007).

  88. 88.

    , , & Bayesian implementation of a genetic model-free approach to the meta-analysis of genetic association studies. Stat. Med. 24, 3845–3861 (2005).

  89. 89.

    & Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease. Nature Genet. 33 (Suppl.), 228–237 (2003).

  90. 90.

    et al. Systematic identification of mammalian regulatory motifs' target genes and function. Nature Methods 5, 347–353 (2008).

  91. 91.

    et al. Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnol. 23, 137–144 (2005).

  92. 92.

    et al. Autoimmune disease risk variant of STAT4 confers increased sensitivity to IFN-α in lupus patients in vivo. J. Immunol. 182, 34–38 (2009).

  93. 93.

    , , , & Impaired autophagy of an intracellular pathogen induced by a Crohn's disease associated ATG16L1 variant. PLoS ONE 3, e3391 (2008).

  94. 94.

    et al. Genetic variation and activity of mouse Nod2, a susceptibility gene for Crohn's disease. Genomics 81, 369–377 (2003).

  95. 95.

    et al. Schizophrenia-related neural and behavioural phenotypes in transgenic mice expressing truncated Disc1. J. Neurosci. 28, 10893–10904 (2008).

  96. 96.

    & Concordance of functional in vitro data and epidemiological associations in complex disease genetics. Genet. Med. 8, 583–593 (2006).

  97. 97.

    et al. Phenotypic, genetic, and genome-wide structure in the metabolic syndrome. BMC Genet. 4 (Suppl. 1), S95 (2003).

  98. 98.

    et al. Genetic overlap among intelligence and other candidate endophenotypes for schizophrenia. Biol. Psychiatry. 65, 527–534 (2009).

  99. 99.

    et al. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science 316, 1336–1341 (2007).

  100. 100.

    et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316, 889–894 (2007).

  101. 101.

    , & Heterogeneity in meta-analyses of genome-wide association investigations. PLoS ONE 2, e841 (2007).

  102. 102.

    et al. Substantial genetic overlap between neurocognition and schizophrenia: genetic modeling in twin samples. Arch. Gen. Psychiatry 64, 1348–1355 (2007).

  103. 103.

    , , & Role of PTPN22 in type 1 diabetes and other autoimmune diseases. Semin. Immunol. 18, 207–213 (2006).

  104. 104.

    et al. Cytotoxic T-lymphocyte associated antigen 4 gene polymorphisms and autoimmune thyroid disease: a meta-analysis. J. Clin. Endocrinol. Metab. 92, 3162–3170 (2007).

  105. 105.

    & CTLA-4 gene polymorphisms and susceptibility to type 1 diabetes mellitus: a HuGE Review and meta-analysis. Am. J. Epidemiol. 162, 3–16 (2005).

  106. 106.

    et al. Two variants on chromosome 17 confer prostate cancer risk, and the one in TCF2 protects against type 2 diabetes. Nature Genet. 39, 977–983 (2007).

  107. 107.

    et al. A common missense variant in the glucokinase regulatory protein gene (GCKR) is associated with increased plasma triglyceride and C-reactive protein but lower fasting glucose concentrations. Diabetes 57, 3112–3121 (2008).

  108. 108.

    & Definition of phenotype. Adv. Genet. 60, 75–105 (2008).

  109. 109.

    & Measurement error in “Big Five Factors” personality assessment: reliability generalization across studies and measures. Educ. Psychol. Meas. 60, 224–235 (2000).

  110. 110.

    et al. Variation in FTO contributes to childhood obesity and severe adult obesity. Nature Genet. 39, 724–726 (2007).

  111. 111.

    , , & An empirical evaluation of multifarious outcomes in pharmacogenetics: β2 adrenoceptor gene polymorphisms in asthma treatment. Pharmacogenet. Genomics 16, 705–711 (2006).

  112. 112.

    et al. The human disease network. Proc. Natl Acad. Sci. USA 104, 8685–8690 (2007).

  113. 113.

    et al. A human phenome–interactome network of protein complexes implicated in genetic disorders. Nature Biotechnol. 25, 309–316 (2007).

  114. 114.

    , , , & A text-mining analysis of the human phenome. Eur. J. Hum. Genet. 14, 535–542 (2006).

  115. 115.

    Complementing the genome with an “exposome”: the outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol Biomarkers Prev. 14, 1847–1850 (2005).

  116. 116.

    et al. Heterogeneity of breast cancer associations with five susceptibility loci by clinical and pathological characteristics. PLoS Genet. 4, e1000054 (2008).

  117. 117.

    NCI–NHGRI Working Group on Replication in Association Studies. Replicating genotype–phenotype associations. Nature 447, 655–660 (2007).

  118. 118.

    Molecular evidence-based medicine: evolution and integration of information in the genomic era. Eur. J. Clin. Invest. 37, 340–349 (2007).

  119. 119.

    et al. The NCBI dbGaP database of genotypes and phenotypes. Nature Genet. 39, 1181–1186 (2007).

  120. 120.

    GAIN Collaborative Research Group. New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nature Genet. 39, 1045–1051 (2007).

Download references


Scientific support for this project was provided through the Tufts Clinical and Translational Science Institute (Tufts CTSI) under funding from the National Institute of Health/National Center for Research Resources (UL1 RR025752 ). Points of view or opinions in this paper are those of the authors and do not necessarily represent the official position or policies of the Tufts CTSI.

Author information


  1. Clinical and Molecular Epidemiology Unit, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine and Biomedical Research Institute, Foundation for Research and Technology — Hellas, Ioannina 45110, Greece.

    • John P. A. Ioannidis
  2. Center for Genetic Epidemiology and Modelling, Institute for Clinical Research and Health Policy Studies, Tufts Medical Center, and Tufts Clinical and Translational Science Institute, Boston, Tufts University School of Medicine, Boston, Massachusetts 02111, USA.

    • John P. A. Ioannidis
  3. Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Department of Health and Human Services, Bethesda, Maryland 20892, USA.

    • Gilles Thomas
  4. Fondation Synergie, INSERM U590, Centre Léon Bérard, 28 Rue Laënnec, 69373 Lyon Cedex 08, France.

    • Gilles Thomas
  5. Center for Human Genetic Research, Massachusetts General Hospital, Richard B. Simches Research Center, Boston, Massachusetts 02114, USA.

    • Mark J. Daly
  6. The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA.

    • Mark J. Daly


  1. Search for John P. A. Ioannidis in:

  2. Search for Gilles Thomas in:

  3. Search for Mark J. Daly in:

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to John P. A. Ioannidis.



The effect of a gene on more than one phenotype or disease.


An analysis that combines the evidence from multiple data sets.

Odds ratio

A measurement of association that is commonly used in case–control studies. It is defined as the odds of exposure to the susceptible genetic variant in cases compared with the odds of exposure in controls. If the odds ratio is significantly greater than one, then the genetic variant is associated with the disease.

Cochran–Armitage test

A genotype-based contingency table test for association that is well suited to the detection of trends across ordinal categories (in this case, genotypes).


(Correlation coefficient). For linkage disequilibrium, it provides a measure of the strength and direction of a linear relationship between the genotypes of two variants expressed as a number of minor alleles.


A highly correlated DNA variant that is an adequate substitute in an association study.

Detection probability

For a two-stage design, this is the probability that a disease-associated SNP will have a p value among the lowest ranks of p values at stage 1 and, among those SNPs selected at stage 1, that a disease-associated SNP will also have a p value among the lowest ranks of p values at stage 2.

Hardy–Weinberg equilibrium

A theoretical description of the relationship between genotype and allele frequencies that is based on an expectation in a stable population undergoing random mating in the absence of selection, new mutations and gene flow. Under these conditions, and in the absence of linkage disequilibrium, the genotype frequencies are equal to the product of the allele frequencies.

Imputation accuracy

This describes the different ways to treat missing genotypes in a data set. Imputed genotypes with less than a pre-specified accuracy can be considered missing or genotypes can be weighted in the calculations on the basis of the estimated imputation accuracy.

Population stratification

The situation that arises when a population contains several subpopulations that differ in their genetic characteristics.


A statistical approach for assessing whether a hypothesis is correct or an alternative should be adopted.

Markov chain Monte Carlo

An iterative computational approach for identifying the most likely model among many possible models.


The determination of the haplotype phase (the arrangement of alleles at two loci on homologous chromosomes) from genotype data using statistical methods.

Winner's curse

The inflation of effect sizes compared with the true effect size for associations that are discovered on the basis of passing specific statistical significance or other selection thresholds.


A metric of between-study heterogeneity taking values between 0 and 100%, which describes how much of the between-study heterogeneity is beyond chance.

Fixed effects model

A set of methods for combining data that assumes there is a common effect in all data sets and that observed effects only differ by chance.

Random effects model

A set of methods for combining data that assumes that genetic effects are different across different populations.

Phenotype misclassification

This describes the situation in which cases are classified as controls or controls are classified as cases for binary outcomes. The equivalent problem for continuous traits is measurement error.

Nested case–control

A design in which cases and controls are sampled from a pre-existing larger cohort.

Convenience sample

A sample of controls or of cases with a trait of interest that is available for another purpose and has not been collected for the purpose of the specific research project or with an explicit sampling scheme.

Principal components analysis

A statistical method used to simplify data sets by transforming a series of correlated variables into a smaller number of uncorrelated factors.

Copy number variant

A class of DNA sequence variants (including deletions and duplications) that lead to a departure from the expected diploid representation of DNA sequence.

Recombination hot spot

A small (usually one to a few kilobases) chromosomal region in which the frequency of meiotic recombination is much higher than average. Hot spots of recombination can be recognized by observing that all pairs of SNPs that encompass the region have a low D′ value.

Gene desert

A stretch of the genome that contains no known protein-coding gene.

Expression quantitative trait locus

A locus at which genetic allelic variation is associated with variation in gene expression.

Bayes factor

The ratio of the prior probabilities of the null hypothesis compared with the alternative hypotheses over the ratio of the posterior probabilities. This can be interpreted as the relative odds that the hypothesis is true before and after examining the data.

Regression model

A model that evaluates the association between one or multiple variables with an outcome of interest.


In a regression model, the tendency to obtain better fit to the available data than to other independent data.

Bayesian method

Any approach that uses a combination of prior beliefs and observed data to generate posterior beliefs.


A physiological or other trait that is related to a disease trait and is measured independently of the disease.

About this article

Publication history



Further reading