Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Statistical analysis strategies for association studies involving rare variants

Key Points

  • We review the motivation for exploring the role of rare variants in phenotypic expression.

  • There are several problems with capturing the effects of rare variants in association studies using current statistical analysis methods.

  • We discuss the concept and use of collapsing sets of rare variants into predictors of phenotypic expression, to aid statistical analyses of rare variant associations.

  • Functional annotations of specific variants and genomic regions can be used to define collapsed sets of rare variants.

  • A range of statistical analysis models and inference-making procedures could be exploited to assess the association between rare variants and phenotypic expression. We discuss the relative merits of these approaches.

  • We compare moving window and defined region approaches to the analysis of rare variant effects.

  • We discuss the importance for rare variant analysis of the flexibility of statistical analysis models and methods in accommodating factors, including common variants, interactions between variants, beneficial and deleterious effects of variants and environmental factors.

Abstract

The limitations of genome-wide association (GWA) studies that focus on the phenotypic influence of common genetic variants have motivated human geneticists to consider the contribution of rare variants to phenotypic expression. The increasing availability of high-throughput sequencing technologies has enabled studies of rare variants but these methods will not be sufficient for their success as appropriate analytical methods are also needed. We consider data analysis approaches to testing associations between a phenotype and collections of rare variants in a defined genomic region or set of regions. Ultimately, although a wide variety of analytical approaches exist, more work is needed to refine them and determine their properties and power in different contexts.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Sample size requirements and statistical power for variants of different frequencies.
Figure 2: Scenarios in which DNA sequence variants distinguish cases and controls.

Similar content being viewed by others

References

  1. Manolio, T. A., Brooks, L. D. & Collins, F. S. A HapMap harvest of insights into the genetics of common disease. J. Clin. Invest. 118, 1590–1605 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009). This paper describes the motivation for considering alternative approaches to discovering the genes that influence common complex diseases. It essentially argues that current GWA study paradigms focusing on common variants have failed to identify the majority of genetic variants that influence particular phenotypes.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Pinto, D. et al. Functional impact of global rare copy number variation in autism spectrum disorders. Nature 466, 368–372 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Frazer, K. A., Murray, S. S., Schork, N. J. & Topol, E. J. Human genetic variation and its contribution to complex traits. Nature Rev. Genet. 10, 241–251 (2009).

    CAS  PubMed  Google Scholar 

  5. Tycko, B. Mapping allele-specific DNA methylation: a new tool for maximizing information from GWAS. Am. J. Hum. Genet. 86, 109–112 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. Kong, A. et al. Parental origin of sequence variants associated with complex diseases. Nature 462, 868–874 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. Eichler, E. E. et al. Completing the map of human genetic variation. Nature 447, 161–165 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. Hunter, D. J. Gene–environment interactions in human diseases. Nature Rev. Genet. 6, 287–298 (2005).

    CAS  PubMed  Google Scholar 

  9. Cordell, H. J. Detecting gene–gene interactions that underlie human diseases. Nature Rev. Genet. 10, 392–404 (2009).

    CAS  PubMed  Google Scholar 

  10. Bodmer, W. & Bonilla, C. Common and rare variants in multifactorial susceptibility to common diseases. Nature Genet. 40, 695–701 (2008).

    CAS  PubMed  Google Scholar 

  11. Schork, N. J., Murray, S. S., Frazer, K. A. & Topol, E. J. Common vs. rare allele hypotheses for complex diseases. Curr. Opin. Genet. Dev. 19, 212–219 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. Cirulli, E. T. et al. Common genetic variation and performance on standardized cognitive tests. Eur. J. Hum. Genet. 18, 815–820 (2010).

    PubMed  PubMed Central  Google Scholar 

  13. Asimit, J. & Zeggini, E. Rare variant association analysis methods for complex traits. Annu. Rev. Genet. 44, 293–308 (2010).

    CAS  PubMed  Google Scholar 

  14. Gorlov, I. P., Gorlova, O. Y., Sunyaev, S. R., Spitz, M. R. & Amos, C. I. Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. Am. J. Hum. Genet. 82, 100–112 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. Pritchard, J. K. Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. 69, 124–137 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. Wood, L. D. et al. The genomic landscapes of human breast and colorectal cancers. Science 318, 1108–1113 (2007). This study suggests that many different mutations in key genes are likely to drive tumorigenesis so that, although patients might have unique mutations, these mutations are likely to be in genes that harbour mutations across many patients. This rare variant heterogeneity may also contribute to the inherited basis of many common chronic diseases.

    CAS  PubMed  Google Scholar 

  17. Lahiry, P., Torkamani, A., Schork, N. J. & Hegele, R. A. Kinase mutations in human disease: interpreting genotype-phenotype relationships. Nature Rev. Genet. 11, 60–74 (2010).

    CAS  PubMed  Google Scholar 

  18. Bobadilla, J. L., Macek, M. Jr, Fine, J. P. & Farrell, P. M. Cystic fibrosis: a worldwide analysis of CFTR mutations — correlation with incidence data and application to screening. Hum. Mutat. 19, 575–606 (2002).

    CAS  PubMed  Google Scholar 

  19. Easton, D. F. et al. A systematic genetic assessment of 1,433 sequence variants of unknown clinical significance in the BRCA1 and BRCA2 breast cancer-predisposition genes. Am. J. Hum. Genet. 81, 873–883 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. Schork, N. J., Wessel, J. & Malo, N. DNA sequence-based phenotypic association analysis. Adv. Genet. 60, 195–217 (2008).

    CAS  PubMed  Google Scholar 

  21. Metzker, M. L. Sequencing technologies — the next generation. Nature Rev. Genet. 11, 31–46 (2010).

    CAS  PubMed  Google Scholar 

  22. Nejentsev, S., Walker, N., Riches, D., Egholm, M. & Todd, J. A. Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science 324, 387–389 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. Ng, S. B. et al. Exome sequencing identifies the cause of a Mendelian disorder. Nature Genet. 42, 30–35 (2010).

    CAS  PubMed  Google Scholar 

  24. Roach, J. C. et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328, 636–639 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. Schork, N. J., Nath, S. K., Fallin, D. & Chakravarti, A. Linkage disequilibrium analysis of biallelic DNA markers, human quantitative trait loci, and threshold-defined case and control subjects. Am. J. Hum. Genet. 67, 1208–1218 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. Lanktree, M. B., Hegele, R. A., Schork, N. J. & Spence, J. D. Extremes of unexplained variation as a phenotype: an efficient approach for genome-wide association studies of cardiovascular disease. Circ. Cardiovasc. Genet. 3, 215–221 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. Gilad, Y., Pritchard, J. K. & Thornton, K. Characterizing natural variation using next-generation sequencing technologies. Trends Genet. 25, 463–471 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. Li, B. & Leal, S. M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 83, 311–321 (2008). One of the first papers to comprehensively evaluate statistical methods for testing collapsed sets of rare variants to a trait. The paper discussed both distance-based and regression approaches.

    CAS  PubMed  PubMed Central  Google Scholar 

  29. Altshuler, D., Daly, M. J. & Lander, E. S. Genetic mapping in human disease. Science 322, 881–888 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. Morgenthaler, S. & Thilly, W. G. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). Mutat. Res. 615, 28–56 (2007). This paper introduced the notion of collapsing sets of variants into a single group whose collective frequency could be contrasted between groups.

    CAS  PubMed  Google Scholar 

  31. McClellan, J. & King, M. C. Genetic heterogeneity in human disease. Cell 141, 210–217 (2010).

    CAS  PubMed  Google Scholar 

  32. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  33. Morris, A. P. & Zeggini, E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet. Epidemiol. 34, 188–193 (2010).

    PubMed  Google Scholar 

  34. Madsen, B. E. & Browning, S. R. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 5, e1000384 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  35. Price, A. L. et al. Pooled association tests for rare variants in exon-resequencing studies. Am. J. Hum. Genet. 86, 832–838 (2010). This paper describes a method for explicitly incorporating information about the likely functional effect of specific rare variants into the formulation of an association statistic. However, the proposed method only considers coding variations.

    PubMed  PubMed Central  Google Scholar 

  36. Ng, S. B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272–276 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. Sebat, J., Levy, D. & McCarthy, S. E. Rare structural variants in schizophrenia: one disorder, multiple mutations; one mutation, multiple disorders. Trends Genet. 25, 528–535 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. Xiong, M., Zhao, J. & Boerwinkle, E. Generalized T2 test for genome association studies. Am. J. Hum. Genet. 70, 1257–1268 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. Lehmann, E. L. Nonparametric Statistical Methods Based on Ranks (McGraw–Hill, New York, 1975).

    Google Scholar 

  40. Han, F. & Pan, W. A data-adaptive sum test for disease association with multiple common or rare variants. Hum. Hered. 70, 42–54 (2010).

    PubMed  PubMed Central  Google Scholar 

  41. Hoh, J. & Ott, J. Scan statistics to scan markers for susceptibility genes. Proc. Natl Acad. Sci. USA 97, 9615–9617 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. Pan, W., Han, F. & Shen, X. Test selection with application to detecting disease association with multiple SNPs. Hum. Hered. 69, 120–130 (2010).

    PubMed  Google Scholar 

  43. Fallin, D. et al. Genetic analysis of case/control data using estimated haplotype frequencies: application to APOE locus variation and Alzheimer's disease. Genome Res. 11, 143–151 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. Zhao, J. H., Curtis, D. & Sham, P. C. Model-free analysis and permutation tests for allelic associations. Hum. Hered. 50, 133–139 (2000).

    CAS  PubMed  Google Scholar 

  45. Zhu, X., Fejerman, L., Luke, A., Adeyemo, A. & Cooper, R. S. Haplotypes produced from rare variants in the promoter and coding regions of angiotensinogen contribute to variation in angiotensinogen levels. Hum. Mol. Genet. 14, 639–643 (2005).

    CAS  PubMed  Google Scholar 

  46. Zhu, X., Feng, T., Li, Y., Lu, Q. & Elston, R. C. Detecting rare variants for complex traits using family and unrelated data. Genet. Epidemiol. 34, 171–187 (2010).

    PubMed  PubMed Central  Google Scholar 

  47. Hartl, D. L. & Clark, A. G. Principles of Population Genetics (Sinauer Associates, Sunderland, Massachusetts, 2007).

    Google Scholar 

  48. Holsinger, K. E. & Weir, B. S. Genetics in geographically structured populations: defining, estimating and interpreting FST . Nature Rev. Genet. 10, 639–650 (2009).

    CAS  PubMed  Google Scholar 

  49. Nei, M. Molecular Evolutionary Genetics (Columbia Univ. Press, New York, 1987).

  50. Jost, L. GST and its relatives do not measure differentiation. Mol. Ecol. 17, 4015–4026 (2008).

    PubMed  Google Scholar 

  51. Mount, D. W. Bioinformatics: Sequence and Genome Analysis (Cold Spring Harbor Laboratory Press, New York, 2001).

    Google Scholar 

  52. Qian, D. & Thomas, D. C. Genome scan of complex traits by haplotype sharing correlation. Genet. Epidemiol. 21 (Suppl. 1), S582–S587 (2001).

    PubMed  Google Scholar 

  53. Tzeng, J. Y., Devlin, B., Wasserman, L. & Roeder, K. On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am. J. Hum. Genet. 72, 891–902 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. Wessel, J. & Schork, N. J. Generalized genomic distance-based regression methodology for multilocus association analysis. Am. J. Hum. Genet. 79, 792–806 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  55. Mukhopadhyay, I., Feingold, E., Weeks, D. E. & Thalamuthu, A. Association tests using kernel-based measures of multi-locus genotype similarity between individuals. Genet. Epidemiol. 34, 213–221 (2009).

    Google Scholar 

  56. Clayton, D., Chapman, J. & Cooper, J. Use of unphased multilocus genotype data in indirect association studies. Genet. Epidemiol. 27, 415–428 (2004).

    PubMed  Google Scholar 

  57. Tzeng, J. Y., Zhang, D., Chang, S. M., Thomas, D. C. & Davidian, M. Gene–trait similarity regression for multimarker-based association analysis. Biometrics 65, 822–832 (2009).

    PubMed  PubMed Central  Google Scholar 

  58. Lin, W. Y. & Schaid, D. J. Power comparisons between similarity-based multilocus association methods, logistic regression, and score tests for haplotypes. Genet. Epidemiol. 33, 183–197 (2009).

    PubMed  PubMed Central  Google Scholar 

  59. Ickstadt, K., Selinski, S. & Muller, T. D. in SFB 475 Komplexitatsreduktion in Multivariaten Datenstrukturen (Univ. Dortmund, Germany, 2005).

    Google Scholar 

  60. Templeton, A. R. et al. Tree scanning: a method for using haplotype trees in phenotype/genotype association studies. Genetics 169, 441–453 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  61. Nair, R. P. et al. Localization of psoriasis-susceptibility locus PSORS1 to a 60-kb interval telomeric to HLA-C. Am. J. Hum. Genet. 66, 1833–1844 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  62. Tachmazidou, I., Verzilli, C. J. & De Iorio, M. Genetic association mapping via evolution-based clustering of haplotypes. PLoS Genet. 3, e111 (2007).

    PubMed  PubMed Central  Google Scholar 

  63. Kowalski, J., Pagano, M. & DeGruttola, V. A nonparametric test of gene region heterogeneity associated with phenotype. J. Am. Stat. Assoc. 97, 398–408 (2002).

    Google Scholar 

  64. Gilbert, P. B., Novitsky, V. A., Montano, M. A. & Essex, M. An efficient test for comparing sequence diversity between two populations. J. Comput. Biol. 8, 123–139 (2001).

    CAS  PubMed  Google Scholar 

  65. Anderson, M. J. Distance-based tests for homogeneity of multivariate dispersions. Biometrics 62, 245–253 (2006).

    PubMed  Google Scholar 

  66. Bhatia, G. et al. A covering method for detecting genetic associations between rare variants and common phenotypes. PLoS Genet. (in the press).

  67. Kooperberg, C., Ruczinski, I., LeBlanc, M. L. & Hsu, L. Sequence analysis using logic regression. Genet. Epidemiol. 21 (Suppl. 1), S626–S631 (2001). One of the first papers to consider statistical methods for identifying optimal sets of predictors of a phenotype from sequence data based purely on the strength of statistical association. This paper proposed a novel regression method for this task.

    Google Scholar 

  68. Ott, J. Analysis of Human Genetic Linkage (Johns Hopkins Univ. Press, Baltimore, 1991).

    Google Scholar 

  69. Kruglyak, L., Daly, M. J., Reeve-Daly, M. P. & Lander, E. S. Parametric and nonparametric linkage analysis: a unified multipoint approach. Am. J. Hum. Genet. 58, 1347–1363 (1996).

    CAS  PubMed  PubMed Central  Google Scholar 

  70. Risch, N. & Merikangas, K. The future of genetic studies of complex human diseases. Science 273, 1516–1517 (1996).

    CAS  PubMed  Google Scholar 

  71. Oexle, K. A remark on rare variants. J. Hum. Genet. 55, 219–226 (2010).

    PubMed  Google Scholar 

  72. Haiman, C. A. et al. Multiple regions within 8q24 independently affect risk for prostate cancer. Nature Genet. 39, 638–644 (2007).

    CAS  PubMed  Google Scholar 

  73. Clarke, R. et al. Genetic variants associated with Lp(a) lipoprotein level and coronary disease. N. Engl. J. Med. 361, 2518–2528 (2009).

    CAS  PubMed  Google Scholar 

  74. Malo, N., Libiger, O. & Schork, N. J. Accommodating linkage disequilibrium in genetic-association analyses via ridge regression. Am. J. Hum. Genet. 82, 375–385 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  75. Hoggart, C. J., Whittaker, J. C., De Iorio, M. & Balding, D. J. Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 4, e1000130 (2008). Refs 74 and 75 introduced regularized regression techniques for accommodating a large number of predictors in a genetic association study and to separate causally associated from non-causally associated variants.

    PubMed  PubMed Central  Google Scholar 

  76. Zhou, H., Sehl, M. E., Sinsheimer, J. S. & Lange, K. Association screening of common and rare genetic variants by penalized regression. Bioinformatics 6 Aug 2010 (doi:10.1093/bioinformatics/btq448).

    CAS  PubMed  PubMed Central  Google Scholar 

  77. Clark, T. G., De Iorio, M., Griffiths, R. C. & Farrall, M. Finding associations in dense genetic maps: a genetic algorithm approach. Hum. Hered. 60, 97–108 (2005).

    PubMed  Google Scholar 

  78. Guo, W. & Lin, S. Generalized linear modeling with regularization for detecting common disease rare haplotype association. Genet. Epidemiol. 33, 308–316 (2009).

    PubMed  PubMed Central  Google Scholar 

  79. Luan, Y. H. & Li, H. Z. Group additive regression models for genomic data analysis. Biostatistics 9, 100–113 (2008).

    PubMed  Google Scholar 

  80. Kwee, L. C., Liu, D. W., Lin, X. H., Ghosh, D. & Epstein, M. P. A powerful and flexible multilocus association test for quantitative traits. Am. J. Hum. Genet. 82, 386–397 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  81. Capanu, M. & Begg, C. B. Hierarchical modeling for estimating relative risks of rare genetic variants: properties of the pseudo-likelihood method. Biometrics 5 Aug 2010 (doi:10.1111/j.1541-0420.2010.01469.x).

    PubMed  PubMed Central  Google Scholar 

  82. Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Series B Stat. Methodol. 58, 267–288 (1996).

    Google Scholar 

  83. Friedman, J. H. Fast sparse regression and classification. (Stanford Univ., California, 2008).

  84. van der Laan, M. J., Polley, E. C. & Hubbard, A. E. Super learner. Stat. Appl. Genet. Mol. Biol. 6, 25 (2007).

    Google Scholar 

  85. Dickson, S. P., Wang, K., Krantz, I., Hakonarson, H. & Goldstein, D. B. Rare variants create synthetic genome-wide associations. PLoS Biol. 8, e1000294 (2010).

    PubMed  PubMed Central  Google Scholar 

  86. Bansal, V., Libiger, O., Torkamani, A. & Schork, N. J. An application and empirical comparison of statistical analysis methods for associating rare variants to a complex phenotype. Pacific Symposium on Biocomputing Proceedings (in the press).

  87. Wessel, J., Schork, A. J., Tiwari, H. K. & Schork, N. J. Powerful designs for genetic association studies that consider twins and sibling pairs with discordant genotypes. Genet. Epidemiol. 31, 789–796 (2007).

    PubMed  Google Scholar 

  88. Nievergelt, C. M., Libiger, O. & Schork, N. J. Generalized analysis of molecular variance. PLoS Genet. 3, e51 (2007).

    PubMed  PubMed Central  Google Scholar 

  89. Moskvina, V., Craddock, N., Holmans, P., Owen, M. J. & O'Donovan, M. C. Effects of differential genotyping error rate on the type I error probability of case-control studies. Hum. Hered. 61, 55–64 (2006).

    PubMed  Google Scholar 

  90. Zschocke, J. Dominant versus recessive: molecular mechanisms in metabolic disease. J. Inherit. Metab. Dis. 31, 599–618 (2008).

    CAS  PubMed  Google Scholar 

  91. Andres, A. M. et al. Understanding the accuracy of statistical haplotype inference with sequence data of known phase. Genet. Epidemiol. 31, 659–671 (2007).

    PubMed  PubMed Central  Google Scholar 

  92. Kim, J. H., Waterman, M. S. & Li, L. M. Accuracy assessment of diploid consensus sequences. IEEE/ACM Trans. Comput. Biol. Bioinform. 4, 88–97 (2007).

    CAS  PubMed  Google Scholar 

  93. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).

    PubMed  PubMed Central  Google Scholar 

  94. Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904–909 (2006).

    CAS  PubMed  Google Scholar 

  95. Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nature Genet. 42, 348–354 (2010).

    CAS  PubMed  Google Scholar 

  96. Li, B. & Leal, S. M. Discovery of rare variants via sequencing: implications for the design of complex trait association studies. PLoS Genet. 5, e1000481 (2009).

    PubMed  PubMed Central  Google Scholar 

  97. Li, Y., Willer, C., Sanna, S. & Abecasis, G. Genotype imputation. Annu. Rev. Genomics Hum. Genet. 10, 387–406 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  98. Wang, K. et al. Interpretation of association signals and identification of causal variants from genome-wide association studies. Am. J. Hum. Genet. 86, 730–742 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  99. Efron, B. Correlation and large-sclae simultaneous significance testing J. Am. Stat. Asso. 102, 92–103 (2007).

    Google Scholar 

  100. Sandelin, A., Wasserman, W. W. & Lenhard, B. ConSite: web-based prediction of regulatory elements using cross-species comparison. Nucleic Acids Res. 32, W249–W252 (2004).

    CAS  PubMed  PubMed Central  Google Scholar 

  101. Matys, V. et al. TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34, D108–D110 (2006).

    CAS  PubMed  Google Scholar 

  102. Visel, A., Minovitsky, S., Dubchak, I. & Pennacchio, L. A. VISTA Enhancer Browser — a database of tissue-specific human enhancers. Nucleic Acids Res. 35, D88–D92 (2007).

    CAS  PubMed  Google Scholar 

  103. Griffiths-Jones, S., Saini, H. K., van Dongen, S. & Enright, A. J. miRBase: tools for microRNA genomics. Nucleic Acids Res. 36, D154–D158 (2008).

    CAS  PubMed  Google Scholar 

  104. Lewis, B. P., Burge, C. B. & Bartel, D. P. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120, 15–20 (2005).

    CAS  PubMed  Google Scholar 

  105. Yeo, G. & Burge, C. B. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J. Comput. Biol. 11, 377–394 (2004).

    CAS  PubMed  Google Scholar 

  106. Cartegni, L., Wang, J., Zhu, Z., Zhang, M. Q. & Krainer, A. R. ESEfinder: a web resource to identify exonic splicing enhancers. Nucleic Acids Res. 31, 3568–3571 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  107. Fairbrother, W. G., Yeh, R. F., Sharp, P. A. & Burge, C. B. Predictive identification of exonic splicing enhancers in human genes. Science 297, 1007–1013 (2002).

    CAS  PubMed  Google Scholar 

  108. Sironi, M. et al. Silencer elements as possible inhibitors of pseudoexon splicing. Nucleic Acids Res. 32, 1783–1791 (2004).

    CAS  PubMed  PubMed Central  Google Scholar 

  109. Wang, Z. et al. Systematic identification and analysis of exonic splicing silencers. Cell 119, 831–845 (2004).

    CAS  PubMed  Google Scholar 

  110. Goren, A. et al. Comparative analysis identifies exonic splicing regulatory sequences-the complex definition of enhancers and silencers. Mol. Cell 22, 769–781 (2006).

    CAS  PubMed  Google Scholar 

  111. Zhang, L. et al. Functional allelic heterogeneity and pleiotropy of a repeat polymorphism in tyrosine hydroxylase: prediction of catecholamines and response to stress in twins. Physiol. Genomics 19, 277–291 (2004).

    PubMed  Google Scholar 

  112. Zhang, C., Li, W. H., Krainer, A. R. & Zhang, M. Q. RNA landscape of evolution for optimal exon and intron discrimination. Proc. Natl Acad. Sci. USA 105, 5797–5802 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  113. Birney, E. et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007).

    CAS  PubMed  Google Scholar 

  114. Kuhn, R. M. et al. The UCSC Genome Browser Database: update 2009. Nucleic Acids Res. 37, D755–D761 (2009).

    CAS  PubMed  Google Scholar 

  115. Matthews, L. et al. Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res. 316, D16–D22 (2009).

    Google Scholar 

  116. Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M. & Hirakawa, M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 38, D35–D60 (2010).

    Google Scholar 

  117. Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).

    CAS  PubMed  Google Scholar 

  118. Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  119. Dahlquist, K. D., Salomonis, N., Vranizan, K., Lawlor, S. C. & Conklin, B. R. GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nature Genet. 31, 19–20 (2002).

    CAS  PubMed  Google Scholar 

  120. Dennis, G. Jr et al. DAVID: database for annotation, visualization, and integrated discovery. Genome Biol. 4, P3 (2003).

    PubMed  Google Scholar 

  121. Suderman, M. & Hallett, M. Tools for visually exploring biological networks. Bioinformatics 23, 2651–2659 (2007).

    CAS  PubMed  Google Scholar 

  122. Karchin, R. Next generation tools for the annotation of human SNPs. Brief. Bioinformatics 10, 35–52 (2009).

    CAS  PubMed  Google Scholar 

  123. Plumpton, M. & Barnes, M. R. in Bioinformatics for Geneticists (ed. Barnes, M. R.) (John Wiley and Sons, New York, 2007). An excellent review of the methods available for computationally assessing the functional impact of DNA sequence variants. It also provides lists of available tools.

    Google Scholar 

  124. Ng, P. C. & Henikoff, S. Predicting the effects of amino acid substitutions on protein function. Annu. Rev. Genomics Hum. Genet. 7, 61–80 (2006).

    CAS  PubMed  Google Scholar 

  125. Andersen, M. C. et al. In silico detection of sequence variations modifying transcriptional regulation. PLoS Comput. Biol. 4, e5 (2008).

    PubMed  PubMed Central  Google Scholar 

  126. Everitt, B. S. Cluster Analysis (John Wiley and Sons, New York, 2009).

    Google Scholar 

  127. Wong, K. M., Suchard, M. A. & Huelsenbeck, J. P. Alignment uncertainty and genomic analysis. Science 319, 473–476 (2008).

    CAS  PubMed  Google Scholar 

  128. Libiger, O., Nievergelt, C. M. & Schork, N. J. Comparison of genetic distance measures using human SNP genotype data. Hum. Biol. 81, 389–406 (2009).

    PubMed  Google Scholar 

  129. Hill, M. O. Diversity and evenness — unifying notation and its consequences. Ecology 54, 427–432 (1973).

    Google Scholar 

  130. Keylock, C. J. Simpson diversity and the Shannon–Wiener index as special cases of a generalized entropy. Oikos 109, 203–207 (2005).

    Google Scholar 

  131. Lande, R. Statistics and partitioning of species diversity, and similarity among multiple communities. Oikos 76, 5–13 (1996).

    Google Scholar 

  132. Jost, L. et al. Partitioning diversity for conservation analyses. Divers. Distrib. 16, 65–76 (2010).

    Google Scholar 

  133. Johansen, C. T. et al. Excess of rare variants in genes identified by genome-wide association study of hypertriglyceridemia. Nature Genet. 42, 684–687 (2010).

    CAS  PubMed  Google Scholar 

  134. Romeo, S. et al. Rare loss-of-function mutations in ANGPTL family members contribute to plasma triglyceride levels in humans. J. Clin. Invest. 119, 70–79 (2009).

    CAS  PubMed  Google Scholar 

  135. Slatter, T. L., Jones, G. T., Williams, M. J., van Rij, A. M. & McCormick, S. P. Novel rare mutations and promoter haplotypes in ABCA1 contribute to low-HDL-C levels. Clin. Genet. 73, 179–184 (2008).

    CAS  PubMed  Google Scholar 

  136. Marini, N. J. et al. The prevalence of folate-remedial MTHFR enzyme variants in humans. Proc. Natl Acad. Sci. USA 105, 8055–8060 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  137. Ji, W. et al. Rare independent mutations in renal salt handling genes contribute to blood pressure variation. Nature Genet. 40, 592–599 (2008).

    CAS  PubMed  Google Scholar 

  138. Frikke-Schmidt, R., Sing, C. F., Nordestgaard, B. G., Steffensen, R. & Tybjaerg-Hansen, A. Subsets of SNPs define rare genotype classes that predict ischemic heart disease. Hum. Genet. 120, 865–877 (2007).

    CAS  PubMed  Google Scholar 

  139. Azzopardi, D. et al. Multiple rare nonsynonymous variants in the adenomatous polyposis coli gene predispose to colorectal adenomas. Cancer Res. 68, 358–363 (2008).

    CAS  PubMed  Google Scholar 

  140. Masson, E., Chen, J. M., Scotet, V., Le Marechal, C. & Ferec, C. Association of rare chymotrypsinogen C (CTRC) gene variations in patients with idiopathic chronic pancreatitis. Hum. Genet. 123, 83–91 (2008).

    CAS  PubMed  Google Scholar 

  141. Ma, X. et al. Full-exon resequencing reveals Toll-like receptor variants contribute to human susceptibility to tuberculosis disease. PLoS ONE 2, e1318 (2007).

    PubMed  PubMed Central  Google Scholar 

  142. Ahituv, N. et al. Medical sequencing at the extremes of human body mass. Am. J. Hum. Genet. 80, 779–791 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  143. Wang, J. et al. Resequencing genomic DNA of patients with severe hypertriglyceridemia (MIM 144650). Arterioscler. Thromb. Vasc. Biol. 27, 2450–2455 (2007).

    CAS  PubMed  Google Scholar 

  144. Cohen, J. C., Boerwinkle, E., Mosley, T. H. Jr & Hobbs, H. H. Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. N. Engl. J. Med. 354, 1264–1272 (2006).

    CAS  PubMed  Google Scholar 

  145. Kotowski, I. K. et al. A spectrum of PCSK9 alleles contributes to plasma levels of low-density lipoprotein cholesterol. Am. J. Hum. Genet. 78, 410–422 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  146. Cohen, J. C. et al. Multiple rare variants in NPC1L1 associated with reduced sterol absorption and plasma low-density lipoprotein levels. Proc. Natl Acad. Sci. USA 103, 1810–1815 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  147. Cohen, J. et al. Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9. Nature Genet. 37, 161–165 (2005).

    CAS  PubMed  Google Scholar 

  148. Cohen, J. C. et al. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305, 869–872 (2004). One of the first papers to explicitly consider the association and effect of a collection of rare variants on a complex phenotype.

    CAS  PubMed  Google Scholar 

  149. Fearnhead, N. S. et al. Multiple rare variants in different genes account for multifactorial inherited susceptibility to colorectal adenomas. Proc. Natl Acad. Sci. USA 101, 15992–15997 (2004).

    CAS  PubMed  PubMed Central  Google Scholar 

  150. Calvo, S. E. et al. High-throughput, pooled sequencing identifies mutations in NUBPL and FOXRED1 in human complex I deficiency. Nature Genet. 5 Sept 2010 (doi:10.1038/ng.659).

    CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the following research grants: U19 AG023122-05, R01 MH078151-03, N01 MH22005, U01 DA024417-01, P50 MH081755-01, R01 AG030474-02, N01 MH022005, R01 HL089655-02, R01 MH080134-03, U54 CA143906-01, UL1 RR025774-03 as well as the Price Foundation and Scripps Genomic Medicine. O.L. is also supported by a grant from Charles University: GAUK 134,609. The authors would like to thank E. Topol, S. Murray, S. Levy and the entire team at The Scripps Translational Science Institute for support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicholas J. Schork.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

Related links

FURTHER INFORMATION

Nicholas J. Schork is at The Scripps Translational Science Institute

BioCarta

Cytoscape

DAVID Bioinformatics Resource

FASTSNP

F-SNP

GeneGo

Gene Ontology

GenMAPP

Human Splicing Finder

Ingenuity Pathway Analysis

Kyoto Encyclopedia of Genes and Genomes

MutDB

Nature Reviews Genetics series on Genome-wide association studies

Nature Reviews Genetics series on Study Designs

PharmGKB

PolyDoms

PupaSuite

Reactome

SeattleSeq

Sequence Variant Analyzer

SNP@Domain

SNPeffect

SNP Functional Portal

TFsearch

Trait-o-matic

University of California, Santa Cruz (UCSC) genome browser

Glossary

Contingency table

A way of representing categorical data in a matrix that is often used to record and analyse the relationship between two or more categorical variables. Also referred to as cross-tabulation or a cross-tab table.

Regression method

A statistical method for predicting or relating a variable (or set of variables) known as the dependent variable to another variable (or set of variables) known as the independent or predictor variable. The resulting relationship defines a regression function.

Compound heterozygosity

A situation in medical genetics in which two normally recessive alleles of a gene cause disease when they are located on different chromosome homologues in the same individual.

Population stratification

The phenomenon of an apparently homogeneous population that is actually composed of subgroups of individuals with distinct ancestral origins and differing allele frequencies at many loci. This leads to bias in assessing the significance of associations of a trait with particular loci.

Multiple testing

In statistics, multiple testing occurs when one considers more than one statistical inference from a single data set. Errors in inference are more likely to occur when one considers all the inferences as a whole.

Imputation

Based on the known linkage disequilibrium structure in fully genotyped individuals, the genotype of untyped variants can be inferred or imputed in individuals who are genotyped for a smaller number of variants.

Conditional test

In regression analysis, the importance of additional variables (or covariates) can be included in the model — that is, the model can be conditioned on the additional variables. A conditional test of the relationship between the primary independent variable and the dependent variable can therefore be performed.

Covariate effect

The influence of non-primary independent variables on the relationship between a primary independent variable and a dependent variable in a regression analysis setting.

Quantile

A point taken at regular intervals in the cumulative distribution function of a random variable. Quantiles are used to define discrete categories of the variable.

Stratified analysis

A data analysis that proceeds by dividing the units of observation into groups and analysing these groups independently.

Group summary information

Statistics that capture frequencies, counts and other measures that reflect information at the population or sample level, in contrast to measures reflecting information that is unique to each individual.

Moving window

A method for testing genetic associations in which a subregion of a larger region is defined. Variants within the defined region are tested for association, then the region is shifted to an adjacent region and the process is repeated until all the subregions have been assessed.

Type I error rate

The probability of a false-positive result from a statistical hypothesis test.

Permutation method

A strategy for assessing the probability of observing the value of a particular statistic. The probability is computed from a data set in which the data are randomly shuffled and the statistic is recomputed from the shuffled data many times and ultimately compared to the value of the statistic obtained with the non-shuffled data.

Phase information

The nucleotide content of each of the homologous chromosomes in a diploid individual.

Fst and Gst statistics

Two classical measures of population differentiation at the nucleotide level. Essentially, Fst and Gst capture and quantify the allele frequency differences between populations.

Logic regression

A regression analysis procedure in which sets of independent variables are grouped together using logical operators such as 'AND' and 'OR.' These sets of independent variables, rather than the individual variables themselves, are tested for association with a dependent variable.

Non-parametric approach

A statistical analysis method that does not rely on specific distributional assumptions (for example, normality) for the variables being analysed.

Multicolinearity

The situation in which two or more predictors (or subsets of predictors) are strongly (but not perfectly) correlated to one another, making it difficult to interpret the strength of the effect of each predictor (or predictor subset). For example, it would be hard to detect a gene if its effect is 'absorbed' (or masked) by combinations of genetic background action or interaction parameters in the model.

Overfitting

A phenomenon in which the predictions of a dependent variable, based on a set of independent variables in a regression setting, are complicated by the fact that there are many more independent variables used in the prediction than there are individuals who have been measured on these independent variables.

Regularization and shrinkage

A method for combating overfitting in regression models. Most independent variables are assumed to make only a small or non-existent contribution to the prediction of a dependent variable. Hence their impact is shrunk or regulated to be close to zero when estimating relevant parameters that govern the regression model.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bansal, V., Libiger, O., Torkamani, A. et al. Statistical analysis strategies for association studies involving rare variants. Nat Rev Genet 11, 773–785 (2010). https://doi.org/10.1038/nrg2867

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg2867

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing