The first phase of genome-wide association studies (GWAS) assessed the role of common variation in human disease. Advances optimizing and economizing high-throughput sequencing have enabled a second phase of association studies that assess the contribution of rare variation to complex disease in all protein-coding genes. Unlike the early microarray-based studies, sequencing-based studies catalogue the full range of genetic variation, including the evolutionarily youngest forms. Although the experience with common variants helped establish relevant standards for genome-wide studies, the analysis of rare variation introduces several challenges that require novel analysis approaches.
Subscribe to Journal
Get full journal access for 1 year
only $22.08 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Visscher, P. M., Brown, M. A., McCarthy, M. I. & Yang, J. Five years of GWAS discovery. Am. J. Hum. Genet. 90, 7–24 (2012).
Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186 (2017).
Goldstein, D. B. Common genetic variation and human traits. N. Engl. J. Med. 360, 1696–1698 (2009).
Cirulli, E. T. & Goldstein, D. B. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat. Rev. Genet. 11, 415–425 (2010).
Gibson, G. Rare and common variants: twenty arguments. Nat. Rev. Genet. 13, 135–145 (2012).
Need, A. C. et al. Clinical application of exome sequencing in undiagnosed genetic conditions. J. Med. Genet. 49, 353–361 (2012).
Zhu, X. et al. Whole-exome sequencing in undiagnosed genetic diseases: interpreting 119 trios. Genet. Med. 17, 774–781 (2015).
Yang, Y. et al. Molecular findings among patients referred for clinical whole-exome sequencing. JAMA 312, 1870–1879 (2014).
Appenzeller, S. et al. De novo mutations in synaptic transmission genes including DNM1 cause epileptic encephalopathies. Am. J. Hum. Genet. 95, 360–370 (2014).
Homsy, J. et al. De novo mutations in congenital heart disease with neurodevelopmental and other congenital anomalies. Science 350, 1262–1266 (2015).
Fitzgerald, T. W. et al. Large-scale discovery of novel genetic causes of developmental disorders. Nature 519, 223–228 (2015).
Gilissen, C., Hoischen, A., Brunner, H. G. & Veltman, J. A. Unlocking Mendelian disease using exome sequencing. Genome Biol. 12, 228 (2011).
Cirulli, E. T. et al. Exome sequencing in amyotrophic lateral sclerosis identifies risk genes and pathways. Science 347, 1436–1441 (2015). Cirulli et al. present one of the first implementations of collapsing analyses in a case–control study of a complex disease, introducing the qualifying-variant framework, coverage correction and other methodological details.
Petrovski, S. et al. An exome sequencing study to assess the role of rare genetic variation in pulmonary fibrosis. Am. J. Respir. Crit. Care Med. 196, 82–93 (2017).
Allen, A. S. et al. Ultra-rare genetic variation in common epilepsies: a case–control sequencing study. Lancet Neurol. 16, 135–143 (2017). This study provides an implementation of collapsing analyses in epilepsy that explicitly evaluates signal as a function of MAF, showing that the association signal observed in epilepsy genes is concentrated amongst the rarest variants.
Traynelis, J. et al. Optimizing genomic medicine in epilepsy through a gene-customized approach to missense variant interpretation. Genome Res. 27, 1715–1729 (2017).
Hayeck, T. J. et al. Improved pathogenic variant localization via a hierarchical model of sub-regional intolerance. Am. J. Hum. Genet. 104, 299–309 (2019). This research uses a hierarchical model for regional intolerance that can jointly use genome-wide, genic and sub-region-level information.
Gussow, A. B., Petrovski, S., Wang, Q., Allen, A. S. & Goldstein, D. B. The intolerance to functional genetic variation of protein domains predicts the localization of pathogenic mutations within genes. Genome Biol. 17, 9 (2016). This paper and reference 19 introduce regional intolerance scoring.
Samocha, K. E. et al. Regional missense constraint improves variant deleteriousness prediction. Preprint at bioRxiv https://doi.org/10.1101/148353 (2017).
Havrilla, J. M., Pedersen, B. S., Layer, R. M. & Quinlan, A. R. A map of constrained coding regions in the human genome. Nat. Genet. 51, 88–95 (2019).
Guo, M. H. et al. Determinants of power in gene-based burden testing for monogenic disorders. Am. J. Hum. Genet. 99, 527–539 (2016).
Skol, A. D., Scott, L. J., Abecasis, G. R. & Boehnke, M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet. 38, 209–213 (2006).
Asimit, J. L., Day-Williams, A. G., Morris, A. P. & Zeggini, E. ARIEL and AMELIA: testing for an accumulation of rare variants using next-generation sequencing data. Hum. Hered. 73, 84–94 (2012).
Morris, A. P. & Zeggini, E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet. Epidemiol. 34, 188–193 (2010).
Li, B. & Leal, S. M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 83, 311–321 (2008). This study presents one of the early burden-testing methods for rare variants.
Madsen, B. E. & Browning, S. R. A groupwise association test for rare mutations using a weighted sum statistic. PLOS Genet. 5, e1000384 (2009).
Han, F. & Pan, W. A data-adaptive sum test for disease association with multiple common or rare variants. Hum. Hered. 70, 42–54 (2010).
Liu, D. J. & Leal, S. M. A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLOS Genet. 6, e1001156 (2010).
Ionita-Laza, I., Buxbaum, J. D., Laird, N. M. & Lange, C. A new testing strategy to identify rare variants with either risk or protective effect on disease. PLOS Genet. 7, e1001289 (2011).
Hoffmann, T. J., Marini, N. J. & Witte, J. S. Comprehensive approach to analyzing rare genetic variants. PLOS ONE 5, e13584 (2010).
Price, A. L. et al. Pooled association tests for rare variants in exon-resequencing studies. Am. J. Hum. Genet. 86, 832–838 (2010).
Neale, B. M. et al. Testing for an unusual distribution of rare variants. PLOS Genet. 7, e1001322 (2011).
Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93 (2011). This study introduces a score-based variance-component test (SKAT) that allows for modelling bidirectional effects.
Lee, S. et al. Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am. J. Hum. Genet. 91, 224–237 (2012). SKAT-O is a unified test that combines burden tests with the non-burden sequence kernel association test.
Lee, S., Abecasis, G. R., Boehnke, M. & Lin, X. Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 95, 5–23 (2014).
Gao, F. et al. XWAS: a software toolset for genetic data analysis and association studies of the x chromosome. J. Hered. 106, 666–671 (2015).
Genovese, G. et al. Clonal hematopoiesis and blood-cancer risk inferred from blood DNA sequence. N. Engl. J. Med. 371, 2477–2487 (2014).
Buscarlet, M. et al. DNMT3A and TET2 dominate clonal hematopoiesis and demonstrate benign phenotypes and different genetic predispositions. Blood 130, 753–762 (2017).
Xie, M. et al. Age-related mutations associated with clonal hematopoietic expansion and malignancies. Nat. Med. 20, 1472–1478 (2014).
Carlston, C. M. et al. Pathogenic ASXL1 somatic variants in reference databases complicate germline variant interpretation for bohring-opitz syndrome. Hum. Mutat. 38, 517–523 (2017).
Lippert, C. et al. Fast linear mixed models for genome-wide association studies. Nat. Methods 8, 833–835 (2011).
Oualkacha, K. et al. Adjusted sequence kernel association test for rare variants controlling for cryptic and family relatedness. Genet. Epidemiol. 37, 366–376 (2013).
Petrovski, S. & Goldstein, D. B. Unequal representation of genetic variation across ancestry groups creates healthcare inequality in the application of precision medicine. Genome Biol. 17, 157 (2016). This study emphasizes the importance of the geographic ancestry of controls for both collapsing analyses and identifying pathogenic mutations in patients.
Zhu, X. et al. A case-control collapsing analysis identifies epilepsy genes implicated in trio sequencing studies focused on de novo mutations. PLOS Genet. 13, e1007104 (2017).This report illustrates that collapsing analyses in a case–control design focused on the rarest variants can pick up the same variants as analyses of de novo mutations using trios.
Hu, Y.-J., Liao, P., Johnston, H. R., Allen, A. S. & Satten, G. A. Testing rare-variant association without calling genotypes allows for systematic differences in sequencing between cases and controls. PLOS Genet. 12, e1006040 (2016).
Guo, M. H., Plummer, L., Chan, Y.-M., Hirschhorn, J. N. & Lippincott, M. F. Burden testing of rare variants identified through exome sequencing via publicly available control data. Am. J. Hum. Genet. 103, 522–534 (2018).
Raghavan, N. S. et al. Whole-exome sequencing in 20,197 persons for rare variants in Alzheimer’s disease. Ann. Clin. Transl. Neurol. 5, 832–842 (2018).
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Sim, N.-L. et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 40, W452–W457 (2012).
Ioannidis, N. M. et al. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99, 877–885 (2016).
Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161–1170 (2018).This study describes a deep neural network trained on hundreds of thousands of common variants from population sequencing of six non-human primate species that can identify pathogenic variants.
Gelfman, S. et al. Annotating pathogenic non-coding variants in genic regions. Nat. Commun. 8, 236 (2017).
Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548.e24 (2019).
Karczewski, K. J. et al. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. Preprint at bioRxiv https://doi.org/10.1101/531210 (2019).
Ganna, A. et al. Quantifying the impact of rare and ultra-rare coding variation across the phenotypic spectrum. Am. J. Hum. Genet. 102, 1204–1211 (2018). Ganna et al. show that across multiple phenotypes, rarer PTVs are on average more deleterious, with the strongest signal coming from ultra-rare variants.
Cameron-Christie, S. et al. Exome-based rare-variant analyses in CKD. J. Am. Soc. Nephrol. 30, 1109–1122 (2019).
Cirulli, E. T. et al. Genome-wide rare variant analysis for thousands of phenotypes in 54,000 exomes. Preprint at bioRxiv https://doi.org/10.1101/692368 (2019). This analysis is the first to look for rare-variant associations in thousands of phenotypes across two large cohorts, including the UK Biobank data.
Loh, P.-R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018).
Wang, X. Firth logistic regression for rare variant association tests. Front. Genet. 5, 187 (2014).
Firth, D. Bias reduction of maximum likelihood estimates. Biometrika 80, 27–38 (1993).
Heinze, G. & Puhr, R. Bias-reduced and separation-proof conditional logistic regression with small or sparse data sets. Stat. Med. 29, 770–777 (2010).
Dey, R., Schmidt, E. M., Abecasis, G. R. & Lee, S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. Am. J. Hum. Genet. 101, 37–49 (2017).
Sham, P. C. & Purcell, S. M. Statistical power and significance testing in large-scale genetic studies. Nat. Rev. Genet. 15, 335–346 (2014).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Dewey, F. E. et al. Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study. Science 354, aaf6814 (2016).
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Preprint at bioRxiv https://doi.org/10.1101/563866 (2019).
Genovese, G. et al. Increased burden of ultra-rare protein-altering variants among 4,877 individuals with schizophrenia. Nat. Neurosci. 19, 1433–1441 (2016).
Gelfman, S. et al. A new approach for rare variation collapsing on functional protein domains implicates specific genic regions in ALS. Genome Res. 29, 809–818 (2019).
Petrovski, S. et al. The intolerance of regulatory sequence to genetic variation predicts gene dosage sensitivity. PLOS Genet. 11, e1005492 (2015).
Baulac, S. et al. Evidence for digenic inheritance in a family with both febrile convulsions and temporal lobe epilepsy implicating chromosomes 18qter and 1q25-q31. Ann. Neurol. 49, 786–792 (2001).
Ito, M. et al. Phenotypes and genotypes in epilepsy with febrile seizures plus. Epilepsy Res. 70, 199–205 (2006).
Fauser, S., Munz, M. & Besch, D. Further support for digenic inheritance in Bardet–Biedl syndrome. J. Med. Genet. 40, e104 (2003).
Katsanis, N. et al. Triallelic inheritance in Bardet–Biedl syndrome, a Mendelian recessive disorder. Science 293, 2256–2259 (2001).
Schäffer, A. A. Digenic inheritance in medical genetics. J. Med. Genet. 50, 641–652 (2013).
Glasscock, E., Qian, J., Yoo, J. W. & Noebels, J. L. Masking epilepsy by combining two epilepsy genes. Nat. Neurosci. 10, 1554–1558 (2007).
Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLOS Comput. Biol. 6, e1001025 (2010).
Ritchie, G. R. S., Dunham, I., Zeggini, E. & Flicek, P. Functional annotation of noncoding sequence variants. Nat. Methods 11, 294–296 (2014).
Gussow, A. B. et al. Orion: detecting regions of the human non-coding genome that are intolerant to variation using population genetics. PLOS ONE 12, e0181604 (2017).
Wang, X. & Goldstein, D. B. Enhancer redundancy predicts gene pathogenicity and informs complex disease gene discovery. Preprint at bioRxiv https://doi.org/10.1101/459123 (2018).
An, J.-Y. et al. Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder. Science 362, eaat6576 (2018).
Fisher, R. A. Statistical Methods for Research Workers (Oliver and Boyd, 1932).
Stouffer, S. A., Suchman, E. A., Devinney, L. C., Star, S. A. & Williams Jr, R. M. The American soldier: Adjustment during army life. (Studies in social psychology in World War II) Vol. 1 (Princeton Univ. Press, 1949).
Liu, L. et al. Analysis of rare, exonic variation amongst subjects with autism spectrum disorders and population controls. PLOS Genet. 9, e1003443 (2013).
Tang, Z.-Z. & Lin, D.-Y. MASS: meta-analysis of score statistics for sequencing studies. Bioinformatics 29, 1803–1805 (2013).
Tang, Z.-Z. & Lin, D.-Y. Meta-analysis of sequencing studies with heterogeneous genetic associations. Genet. Epidemiol. 38, 389–401 (2014).
Feng, S., Liu, D., Zhan, X., Wing, M. K. & Abecasis, G. R. RAREMETAL: fast and powerful meta-analysis for rare variants. Bioinformatics 30, 2828–2829 (2014).
Liu, D. J. et al. Meta-analysis of gene-level tests for rare variant association. Nat. Genet. 46, 200–204 (2014).
Lee, S., Teslovich, T. M., Boehnke, M. & Lin, X. General framework for meta-analysis of rare variants in sequencing association studies. Am. J. Hum. Genet. 93, 42–53 (2013).
Bagnall, R. D. et al. Exome-based analysis of cardiac arrhythmia, respiratory control, and epilepsy genes in sudden unexpected death in epilepsy. Ann. Neurol. 79, 522–534 (2016).
Sanna-Cherchi, S. et al. Exome-wide association study identifies greb1l mutations in congenital kidney malformations. Am. J. Hum. Genet. 101, 789–802 (2017).
Freischmidt, A. et al. Haploinsufficiency of TBK1 causes familial ALS and fronto-temporal dementia. Nat. Neurosci. 18, 631–636 (2015).
Farhan, S. M. K. et al. Enrichment of rare protein truncating variants in amyotrophic lateral sclerosis patients. Preprint at bioRxiv https://doi.org/10.1101/307835 (2018).
Christophersen, I. E. et al. Large-scale analyses of common and rare variants identify 12 new loci associated with atrial fibrillation. Nat. Genet. 49, 946–952 (2017).
Bellenguez, C. et al. Contribution to Alzheimer’s disease risk of rare variants in TREM2, SORL1, and ABCA7 in 1779 cases and 1273 controls. Neurobiol. Aging 59, 220.e1–220.e9 (2017).
Singh, T. et al. The contribution of rare variants to risk of schizophrenia in individuals with and without intellectual disability. Nat. Genet. 49, 1167–1173 (2017).
Do, R. et al. Exome sequencing identifies rare LDLR and APOA5 alleles conferring risk for myocardial infarction. Nature 518, 102–106 (2015).
Groopman, E. E. et al. Diagnostic utility of exome sequencing for kidney disease. N. Engl. J. Med. 380, 142–151 (2019).
Telenti, A. et al. Deep sequencing of 10,000 human genomes. Proc. Natl Acad. Sci. USA. 113, 11901–11906 (2016).
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Med. 12, e1001779 (2015).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Van Hout, C. V. et al. Whole exome sequencing and characterization of coding variation in 49,960 individuals in the UK Biobank. Preprint at bioRxiv https://doi.org/10.1101/572347 (2019).
Zhang, D. et al. SEQSpark: a complete analysis tool for large-scale rare variant association studies using whole-genome and exome sequence data. Am. J. Hum. Genet. 101, 115–122 (2017).
Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–D985 (2014).
The authors thank T. Hayeck for creating the figure in Box 2.
The authors declare no competing interests.
Peer review information
Nature Reviews Genetics thanks S. Lee, X. Lin and B. Neale for their contribution to the peer review of this work.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Digenic analysis tool: https://github.com/igm-team/Digenic
UK Biobank: https://www.ukbiobank.ac.uk/
- Penetrant alleles
Alleles highly associated with a trait; the more penetrant the allele, the higher the percentage of individuals with that allele who also express a disease phenotype.
- Deleterious variation
Genetic variation that is predicted to disrupt gene function and therefore lead to reduced fitness.
- Allelic heterogeneity
The presence of different pathogenic variants in the same gene or at the same chromosome locus that all lead to the same or to very similar phenotypes.
- Causal allele
A functional allele that increases disease risk.
- Haploinsufficient disease genes
Disease-associated genes for which a single functional copy is insufficient to maintain normal function. Therefore, loss-of-function alleles are pathogenic even when heterozygous.
- Background variation
Usually benign variants in the general population that are unconnected to the disease.
- Bidirectional effects
Effects within a given gene, wherein some variants increase risk of disease, while others reduce risk.
- Transition/transversion ratio
(Ti/Tv). Ratio of the number of transitions (interchanges of two-ring purines (A to G or vice versa) or of one-ring pyrimidines (C to T or vice versa)) to the number of transversions (interchanges of purine for pyrimidine bases).
- Index samples
Individual samples or patients who are the focus of a study.
- Consanguineous populations
Populations in which marriages between people who are second cousins or closer are common.
- Bottlenecked populations
Populations that have gone through a severe and abrupt reduction in their number of individuals, which often leads to reduced genetic diversity.
- Population stratification
Also known as population structure. Presence of a difference in allele frequencies due to systematic differences in ancestry between cases and controls.
- Phred quality
(QUAL). The Phred-scaled posterior probability that all samples in a call set consist of homozygous reference alleles.
- Genotype Phred quality
(GQ). Represents the Phred-scaled confidence that the genotype assignment is correct for a given sample.
- Quality by depth
(QD). The Phred quality (QUAL) score normalized by allele depth for a variant.
- Mapping quality
(MQ). Estimation of the overall mapping quality of reads supporting a variant call.
- Variant quality score log-odds
(VQSLOD). A score, produced by the Genome Analysis Toolkit’s variant quality score recalibration, that represents the log-odds ratio of a variant being true versus being false under the trained Gaussian mixture model.
- Trio sequencing
Procedure in which the index patient and both parents are sequenced in order to identify causative variants in the patient.
Defined as alleles that belong to the same parental haplotype and therefore affect the same copy of a gene; variants that are not in phase are on different haplotypes and therefore affect both copies of a gene.
- Compound heterozygous
Presence of two different mutant alleles in a particular gene that affect both copies of the gene because they are not in phase.
- Diagnostic yield
Rate of discovered diagnostic variants within a collection of cases being tested.
About this article
Cite this article
Povysil, G., Petrovski, S., Hostyk, J. et al. Rare-variant collapsing analyses for complex traits: guidelines and applications. Nat Rev Genet 20, 747–759 (2019) doi:10.1038/s41576-019-0177-4