Since the completion of the Human Genome Project and the technological revolution that it launched, the challenges of producing high-quality whole-genome sequence (WGS) data have largely been met. Today, many large-scale international studies are beginning to sequence the genomes of many thousands of individuals in order to understand the genetic etiology of common and rare, complex and Mendelian human traits and diseases.1

Recently, several smaller-sized projects, based on both low-coverage whole-genome and high-coverage whole-exome sequence (WES) data, have developed strategies to overcome many of the technical challenges (eg data compression, sequence alignment and genotype calling) associated with such large-scale projects.2, 3 Nevertheless, as large-scale WGS data are generated with the goal of discovering genetic associations that inform disease treatment and prevention, significant scientific and computational challenges remain. As projects such as the Precision Medicine Initiative are launched,1, 4 experiences from the era of large-scale WES can inform the design and analysis of large-scale WGS data.

For all studies an important consideration is the statistical power to detect associations, which is a function of the genetic architecture (ie, the allele frequencies and effect sizes). Because the vast majority of variants within the genome are rare with allele frequencies <1%,2, 3 a comprehensive search for genetic associations must account for rare variants. As the results from recent studies attest, many rare-variant associations display moderate to modest effect sizes (odds ratios <1.4 for binary traits and effect sizes <0.5 trait standard deviations for quantitative traits).5, 6, 7, 8

If these results hold true for rare variants generally, the power to detect rare-variant associations will be severely limited compared with common-variant association studies and sufficiently powered studies may require even larger sample sizes than genome-wide association studies of common variants. Future studies focused on rare-variant associations should consider the successes that innovative statistical techniques for rare-variant association testing as well as efficient study designs have made on addressing the issue of limited statistical power.

One study design that can be used to increase power to detect associations for quantitative traits, compared with random ascertainment, is the extreme trait sampling design.9 Briefly, an extreme trait design considers the entire distribution of a quantitative trait and selects samples from the tails or ‘extremes’ of the distribution. This technique can be generalized to case–control studies as well by sampling from the extremes for various risk factors, for example, early onset cases and older, high-risk controls as this may enhance the power to detect associations. The power of an extreme trait design is driven by the size of the underlying population from which the samples are selected.10, 11 There are multiple examples of new rare-variant associations being discovered from extreme samples drawn from large, population-based studies.12, 13, 14 Though extreme trait sampling represents a powerful approach, they may not be suitable for every type of study. First, conclusions from extreme trait designs may be difficult to generalize due to differences in genetic architecture at the extremes of a quantitative trait. For instance, the stature of individuals in the extremes of height is typically driven by very few large effect variants, whereas for individuals in the ‘middle’ of the distribution, height is typically determined by hundreds of loci of small effects.15 Second, unless specialized analytical methods are used that specifically acknowledge the sampling design, the analysis of secondary traits from an extreme trait design can lead to biased- or false-positive findings.16

In addition to sample selection, large-scale sequencing projects need to weigh the trade-offs associated with sequencing depth and sample size. High-quality genotypes may be obtained from low-coverage sequencing (defined here as <10 ×) of whole genomes by using haplotype aware genotype callers.3 However, this is not possible with WES data as it is difficult to reconstruct accurate haplotypes based on exonic variants alone. Though previous studies have shown that lower read depth may increase power by increasing the available sample size,17 currently WGS data are typically generated with 30 × coverage as this is the standard protocol on the ubiquitously used Illumina HiSeq instrument (San Diego, CA, USA). For studies investigating structural variants and very rare or private variants, deep sequencing of at least 30 × is the way to guarantee high-quality variant calls.

An attractive alternative approach to direct sequencing is to impute sequence data into a set of samples with genotyping array data. The data from the 1000 Genomes Project have become a popular ‘reference panel’ for this type of genotype imputation. Newer sequence data sets, such as from the UK10K project, have augmented the 1000 Genomes data to provide reference panels capable of accurately imputing variants of <1% allele frequency.18 The Haplotype Reference Consortium has gathered WGS data on 64 976 individuals and claims accurate imputation down to 0.1% allele frequencies. With such rich, publically available resources, genotype imputation provides a cost-effective strategy for investigating low-frequency and rare-variant associations.

In addition to study design, statistical techniques can be leveraged to test for aggregate rare-variant associations based on the premise that multiple rare variants within a gene or region contribute to an association. There are many points to consider when conducting rare-variant tests and many models have been developed accordingly. The choice of a minor allele frequency cutoff for including variants in these tests has been debated since the first methods were proposed.19 Typically a 1% (or 0.5%) cutoff is enforced, though methods such as Variable Threshold20 have been developed so that this cutoff is not defined arbitrarily, and all possible cutoffs are considered. A related point is whether variants should be weighted according to minor allele frequency,21 or a prediction score for whether the variant is likely to be deleterious22 or effect protein structure.23 Methods exist to test for many different genetic models; there are fixed effect models such as the combined multivariate collapsing (CMC) approach that tests a dominance model,19 and the GRANVIL approach (gene- or region-based analysis of variants of intermediate and low frequency) that tests an additive model;24 and random effect tests such as the Sequence Kernel Association Test (SKAT) that tests for heterogeneity of effect across variants (this is also referred to as a ‘variance component test’).25 There are adaptive tests that seek to estimate some of these parameters while simultaneously testing for association (eg, KBAC, VW-TOW),26, 27 as well as omnibus methods that test both fixed effect and variance component models simultaneously (eg, SKAT-O and MiST).28, 29 Lee et al30 present a comprehensive review of the statistical issues related to rare-variant association testing.

Despite significant progress in the statistical literature on this topic, only a few empirical examples have emerged that demonstrate multiple rare variants within the same gene contributing to an association. Although there are examples of fixed effect tests (eg, CMC, GRANVIL) finding rare-variant associations that were undetected with random effect tests (eg, SKAT),5, 7 we are not aware of single example where a random effect test has found a rare-variant association that was not also found by a fixed effect test.

As studies transition from WES to WGS, considerations on the proper unit of analysis is important. Although the exome offers a natural unit of analysis (ie, a gene) for aggregate rare-variant association methods, it is unclear how best to aggregate association signals outside of coding regions and whether the genetic effects in enhancers, promoters and other elements related to gene regulation will be detectable by the same aggregate methods that are used for exomes. For WGS studies to uncover aggregate signals outside of coding regions, a better understanding of the genome outside of the coding region will be crucial. However, aggregate rare-variant testing is really a means for gaining statistical power to detect associations. With sample sizes in the hundreds of thousands, future studies may be well-powered to detect individual rare-variant associations of modest effect, rendering moot the need to aggregate signal across genomic regions.

As the data deluge from WGS technologies continues, new software tools capable of handling massive data volumes, high dimensions, and sophisticated, statistical and computational analyses will be needed. In the past, researchers developed robust, accessible and user-friendly software tools for previous generations of genetic studies (eg, the GWAS- and WES-eras).31, 32 A crucial component of success moving forward will be the development of new computational tools and paradigms (eg, cloud-based computing) that scale with the size and complexity of WGS data from large-scale studies.

Finally, and perhaps most importantly, data sharing among researchers will be critical for future WGS studies. In order to obtain the sample sizes needed for well-powered analyses, the sharing of data and/or summary statistics for meta-analysis are both critical. The scientific networks that permit data sharing also lead to increased sharing of expertise, a priceless commodity in the field of human genetics where the goals of biologists, statisticians, computer scientists and clinical practitioners align.