The contribution of rare and low-frequency variants to human traits is largely unexplored. Here we describe insights from sequencing whole genomes (low read depth, 7×) or exomes (high read depth, 80×) of nearly 10,000 individuals from population-based and disease collections. In extensively phenotyped cohorts we characterize over 24 million novel sequence variants, generate a highly accurate imputation reference panel and identify novel alleles associated with levels of triglycerides (APOB), adiponectin (ADIPOQ) and low-density lipoprotein cholesterol (LDLR and RGAG1) from single-marker and rare variant aggregation tests. We describe population structure and functional annotation of rare and low-frequency variants, use the data to estimate the benefits of sequencing for association studies, and summarize lessons from disease-specific collections. Finally, we make available an extensive resource, including individual-level genetic and phenotypic data and web-based tools to facilitate the exploration of association results.
Assessment of the contribution of rare genetic variation to many human traits is still largely incomplete. In common and complex diseases, a lack of empirical data has to date hampered the systematic assessment of the contribution of rare and low-frequency genetic variants (defined throughout this paper as minor allele frequency (MAF) <1% and 1–5%, respectively). Rare variants are incompletely represented in genome-wide association (GWA) studies1 and custom genotyping arrays2,3, and impute poorly with current reference panels. Rare and low-frequency variants also tend to be population- or sample-specific, requiring direct ascertainment through resequencing4,5. Recent exome-wide resequencing studies have begun to explore the contribution of rare coding variants to complex traits6, but comparatively little is known of the non-coding part of the genome where most complex trait-associated loci lie7. At the other end of the human disease spectrum, the widespread application of exome-wide sequencing is accelerating the rate at which genes and variants causal for rare diseases are being identified. Despite this, many Mendelian diseases still lack a genetic diagnosis and the penetrance of apparently disease-causing loci remains inadequately assessed.
The UK10K project was designed to characterize rare and low-frequency variation in the UK population, and study its contribution to a broad spectrum of biomedically relevant quantitative traits and diseases with different predicted genetic architectures. Here we describe the data and initial findings generated by the different arms of the UK10K project. In addition to this paper, UK10K companion papers describe the utility of this resource for imputation8, association discovery for bone mineral density9, thyroid function10 and circulating lipid levels11 and provide access to the study results through novel web tools12.
Study designs in the UK10K project
The UK10K project includes two main project arms (Table 1). The UK10K-cohorts arm aimed to assess the contribution of genome-wide genetic variation to a range of quantitative traits in 3,781 healthy individuals from two intensively studied British cohorts of European ancestry, namely the Avon Longitudinal Study of Parents and Children (ALSPAC)13 and TwinsUK14. A low read depth (average 7×) whole-genome sequencing (WGS) strategy was employed in order to maximize total variation detected for a given total sequence quantity15 while allowing interrogation of noncoding variation. Sixty-four different phenotypes were analysed, including traits of primary clinical relevance in 11 major phenotypic groups (obesity, diabetes, cardiovascular and blood biochemistry, blood pressure, dynamic measurements of ageing, birth, heart, lung, liver and renal function; Supplementary Table 1). Of these, 31 phenotypes were available in both studies (referred to as ‘core’ and reported in association analyses), 18 were unique to TwinsUK and 15 were unique to ALSPAC.
The UK10K-exomes arm aimed to identify causal mutations through high read depth (mean ∼80× across studies) whole-exome sequencing of approximately 6,000 individuals from three different collections: rare disease, severe obesity and neurodevelopmental disorders. The disorders studied in the UK10K-exomes arm have been shown to have a substantial genetic component at least partially driven by very rare, highly penetrant coding mutations. The rare disease collection includes 125 patients and family members in each of eight rare disease areas (Table 1). Disease types were selected with different degrees of locus heterogeneity, prior evidence for monogenic causation and likely modes of inheritance (for example, dominant or recessive). The obesity collection comprises of samples with severe obesity phenotypes, including approximately 1,000 subjects from the Severe Childhood Onset Obesity Project (SCOOP)16, plus severely obese adults from several population cohorts. The neurodevelopmental collection comprises of ∼3,000 individuals selected to study two related neuropsychiatric disorders (autism spectrum disorder and schizophrenia).
Discovery of 24 million novel genetic variants
In total, 3,781 individuals were successfully whole-genome sequenced in the UK10K-cohorts arm. After conservative quality control filtering (Extended Data Figs 1 and 2 and Supplementary Table 2), the final call set contained over 42M single nucleotide variants (SNVs, 34.2M rare and 2.2M low-frequency), ∼3.5M insertion/deletion polymorphisms (INDELs; 2,291,553 rare and 415,735 low-frequency) and 18,739 large deletions (median size 3.7 kilobase). Each individual on average contained 3,222,597 SNVs (5,073 private), 705,684 INDELs (295 private) and 215 large deletions (less than 1 private). Of 18,903 analysed protein-coding genes, 576 genes contained at least one homozygous or compound heterozygous variant predicted to result in the loss of function of a protein (LoF, Supplementary Information, 14,516 variants in total). As previously shown5,17, variants predicted to have the greatest phenotypic impact (LoF and missense variants, and variants mapping to conserved regions), were depleted at the common end of the derived allele spectrum (Extended Data Fig. 3). There were 495 homozygous LoF variants, a subset of which associated with phenotypic outliers (Supplementary Table 3).
We assessed sequence data quality by comparison with an exome sequencing data set (WES, ∼50 × coverage)18 and in 22 pairs of monozygotic twins (Extended Data Fig. 1). The non-reference discordance (NRD, or the fraction of discordant genotypes for non-reference homozygous or heterozygous alleles) was 0.6% for common variants and 3.2% (range 0.1–3.3%; Extended Data Fig. 1) for low-frequency and rare variants. False discovery rates (FDR) were comparable between newly discovered sites and sites previously reported in the 1000 Genomes Project phase 1 (1000GP) data set5.
When compared to two large-scale European sequencing repositories, 1000GP and the Genome of the Netherlands (GoNL, 12 × read depth19), UK10K-cohorts discovered over 24M novel SNVs. Overall, 96.5% of variants with MAF > 1% were shared, reflecting a common reservoir within Europe (Fig. 1 and Extended Data Fig. 2). Conversely, 94.7% of singleton (allele count (AC) = 1) and 55.0% of rare (AC > 1 and MAF < 1%) SNVs were study-specific. In a similar comparison, 64.4% (AC = 1) and 15.8% of variants (AC > 1 and MAF < 1%) found in GoNL were found to be study-specific compared to 1.2% of variants above 1% MAF.
This deeper characterization of European genetic and haplotype diversity will benefit future studies by creating a novel genotype imputation panel with substantially increased coverage and accuracy compared to the 1000GP reference panel8 (see ref. 9 and the next section for its application). It further informs a detailed empirical assessment of the geographical structure of rare variation in the UK where we detected geographical structure for very rare alleles (AC = 2–7) in Northern and Western UK regions, although this did not show evidence of substantial correlation with variation in phenotype (Box 1).
Findings from single-marker association tests
A main aim of the UK10K-cohorts project was to assess associations of low-frequency and rare variants under different analytical strategies (Fig. 2). We used a unified analysis strategy for the parallel evaluation of all quantitative traits (Supplementary Information, Supplementary Table 4). Here we describe results for the 31 core traits shared in ALSPAC and TwinsUK, with other results reported elsewhere12.
We first carried out single-marker association tests, as in standard genome-wide association studies of common variants20. Assuming an additive genetic model, we used standard approaches to model relationships between standardised traits, residualized for relevant covariates, and allele dosages of 13,074,236 SNVs, 1,122,542 biallelic INDELs (MAF ≥ 0.1%) and 18,739 large deletions in whole-genome sequenced samples (‘WGS sample’). We further assessed associations in an independent study sample of genome-wide genotyped individuals (‘GWA’ sample) including up to 6,557 ALSPAC and 2,575 TwinsUK participants who were not part of UK10K (actual numbers per trait are given in Supplementary Table 1). In the GWA sample, genotypes were imputed from genome-wide single nucleotide polymorphism (SNP) data using the UK10K haplotype reference panel, described in a companion manuscript8. The combined WGS+GWA sample had 80% power to detect associations of SNVs of low-frequency and rare down to ∼MAF 0.5%, for a per-alleles trait change (the regression beta coefficient or Beta) of ∼1.2 standard deviations or greater (Fig. 3). To combine WGS and GWA data we carried out a fixed effect meta-analysis using the inverse variance method, which showed no evidence of inflation of summary statistics at the traits investigated (GC lambda ≈ 1). We used a conservative stepwise procedure for reporting loci from single-variant analysis (Supplementary Table 5), and we discuss elsewhere replication and technical validation of associations of rare variants not supported in the combined WGS+GWA sample (Supplementary Information, Supplementary Table 6).
Overall, across the 31 traits 27 independent loci reached our experiment-wide significance threshold21 P value ≤ 4.62 × 10−10 in the combined WGS+GWA sample (Fig. 3 and Supplementary Table 5). Two associations have been newly discovered by this project, and were conditionally independent of other variants previously reported at the same loci. The first was a low-frequency intronic variant in ADIPOQ associated with decreased adiponectin levels (rs74577862-A, effect allele frequency (EAF) = 2.6%, P value = 3.04 × 10−64). The second was a rare splice variant (rs138326449) in APOC3 described in advance of this manuscript11,22,23. The remaining 25 loci reaching experiment-wide significance in the combined WGS+GWA sample included common, low-frequency and rare variants tagging known associations with adiponectin levels (CDH13 and ADIPOQ), lipid traits (APOB, APOC3-APOA1, APOE, CETP, LIPC, LPL, PCSK9, SORT1-PSRC1-CELSR2), C-reactive protein (LEPR), haemoglobin levels (HFE) and fasting glycaemic traits (G6PC2-ABCB11, Supplementary Table 5). In contrast to previous projections24, from this analysis of a wide range of biomedical traits there was no evidence of low-frequency alleles with large effects upon traits (Fig. 3)25, with classical lipid alleles identifying extremes of single-variant genetic contributions for these traits. This suggests that few, if any, low-frequency variants with stronger effects than those we see are likely to be detected in the general European population for the wide range of traits that we considered.
Increasing sample size may identify additional moderate effect variants, or variants with rarer frequency. We therefore sought to assess the extent to which the more accurate imputation offered by the UK10K reference panel, applied to larger study samples, could discover additional associations. A restricted maximum likelihood (REML)26 analysis suggested that using the UK10K data could increase the estimated variance explained, compared to the sparser HapMap2, HapMap3 and 1000GP data sets (Extended Data Table 1). We tested four lipid traits (high-density and low-density lipoprotein cholesterol, total cholesterol and triglycerides) in up to 22,082 additional samples from 14 cohorts imputed to the combined UK10K+1000GP phase I panel (Supplementary Table 7).
This effort identified two novel associations with low-density lipoprotein cholesterol (Fig. 3, Supplementary Table 8), which we further replicated in an independent imputation data set of 15,586 samples from 8 cohorts and through genotyping in 95,067 samples from the Copenhagen General Population Study (CGPS27). The first was a rare intronic variant in LDLR (rs72658867-A, c.2140 + 5G > A; EAF = 0.01, combined sample P value = 1.27 × 10−46); per allele effect Beta (s.e.m.) = −0.23 mmol l−1 (0.02), P value = 7.63 × 10−30 (CGPS, n = 95,079). The second was a common, X-linked variant near RGAG1 (rs5985471-T, EAF = 0.403, P value = 1.53 × 10−12); per allele effect Beta (s.e.m.) = −0.02 mmol l−1 (0.004), P value = 1.8 × 10−5 (CGPS, n = 93,639). The LDLR variant was previously classified to be of uncertain impact in ClinVar, and reported to have no effect on plasma cholesterol levels in a small sample of familial hypercholesterolaemia patients28. The LDLR-A allele is almost perfectly imputed in our sample (info = 0.96), but absent in previous imputation panels29; the RGAG1-T allele is common but was missed in previous studies, which focused predominantly on autosomal variation29. Within CGPS, these variants were weakly associated with ischaemic heart disease (odds ratio (OR) = 0.77(0.66, 0.92), P = 0.003 for rs72658867; 0.96(0.94, 0.99), P = 0.005 for rs5985471) and rs72658867 with myocardial infarction (OR = 0.65(0.49, 0.87), P = 0.003; Supplementary Table 8). These results demonstrate the value of our expanded haplotype reference panel for discovery of trait associations driven by low-frequency and rare variants, as also shown in refs 9, 10.
Findings from rare variant association tests
Single-marker association tests are typically underpowered for rare variants30. Many questions remain regarding the optimal choice of test, owing to the unknown allelic architecture of rare variant contribution to traits, in particular outside protein-coding regions. We first evaluated associations by considering genes (GENCODE v15) as functional units of analysis using three separate variant selection strategies. Naive tests considered all variants in exons, untranslated regions (UTRs) and essential splice sites, weighted equally. Functional tests considered missense and LoF variants, the latter defined as being predicted to cause essential splice site changes, stop codon gains or frameshifts. For each scenario we applied two separate statistical models with different properties, sequence kernel association tests (SKAT) and burden tests implemented in SKAT and SKAT-O31,32, to rare variants (MAF < 1%).
Overall, there was an excess of test statistics with P values ≤10−4 for functional and loss-of-function tests (Extended Data Figs 4 and 5), with a total of 9, 70 and 196 genes associated with the 31 core traits with the LoF, functional and naive tests, respectively (Supplementary Table 9). A signal driven by loss-of-function variants in the APOB gene (encoding apolipoprotein B) achieved our threshold for experiment-wide significance (P value ≤1.97 × 10−7), in a burden-type test (min P value for TG = 7.02 × 10−9). Overall, 3 singleton LoF variants were responsible for this signal, of which two were not previously reported (rs141422999 and Chr2:21260958). Examples of novel rare variants in complex trait-associated loci (for example, G6PC2 associated with fasting glucose) were also seen for genes reaching suggestive levels of association (P value ≤10−4). Lastly, we tested the value of a genome-wide naive approach to explore associations outside protein-coding genes by combining variants across ∼1.8 million genome-wide tiled windows of 3 kb in size (median 37 SNVs per window, MAF < 1%, assigning an equal weight to all variants in the window). Overall association statistics appeared underpowered to detect true signals, apart from an association signal for adiponectin driven by a known rare intronic variant at the CDH13 locus (rs12051272, EAF = 0.09%, P value = 6.52 × 10−12; Supplementary Table 10)33,34. As previously shown for single-variant tests, in this study adiponectin and lipid traits yielded the greatest evidence for associations for region-based tests.
Informing studies of low-frequency and rare variants
The UK10K-cohorts data allow an empirical evaluation of the relative importance of increasing sample size, genotyping accuracy or variant coverage for increasing power of genetic discoveries across the allele frequency spectrum. In a companion paper8 we show that common variants are exhaustively and accurately imputed using current haplotype reference panels, so increasing sample size is likely to be the single most beneficial approach for discovering novel loci driven by common variants. We further show that the UK10K haplotype reference panel, with tenfold more European samples compared to 1000GP, yields substantial improvements in imputation accuracy and coverage for low-frequency and rare variants. To obtain realistic estimates of the power benefit due to imputation with 1000GP+UK10K compared to 1000GP alone, we averaged the smallest value of Beta (the magnitude of a per-allele effect measured in standard deviations) detectable at 80% power, across variants imputable from both reference panels on chromosome 20. Fig. 4a shows sizable reductions in the magnitude of the effect sizes that can be identified at any sample size through use of the UK10K reference panel, compared to the 1000GP panel alone. For instance, for a variant of MAF = 0.3% we have equivalent power when imputing from UK10K+1000GP into a 3,621 sample as we have when using the 1000GP imputation panel alone with 10,000 samples.
Similar, although weaker, increases in power were seen for region-based tests of rare variants. Using the WGS autosome data from UK10K, we used simulation to introduce genotype errors into 220 randomly selected regions of 30 variants each. For each variant, errors were simulated to match the MAF and the observed r2 values between imputation and sequencing, and between whole-exome and whole-genome sequencing (Supplementary Table 11). We modified the SKAT power calculator35 to estimate power both for the true genotypes in a region and the data containing error, and averaged results across the 220 regions (see Supplementary Information). Although absolute power in Fig. 4b is generally poor, we can also see demonstrable power improvements when data are better imputed or are directly sequenced (Fig. 4c).
Tests involving non-coding rare variants may further benefit from aggregation strategies driven by biological annotation that takes into consideration the context- and trait-specific impact of non-coding variation36,37,38. Exploiting the denser sequence ascertainment of the UK10K-cohorts, we developed a robust approach to quantify fold-enrichment statistics for different categories of non-coding variants compared to null sets matched for minor allele frequency, local linkage disequilibrium and gene density (Supplementary Information). We used this approach to assess the relative contribution of low-frequency and common variants to associations with five exemplar lipid measures (the study did not have sufficient signal for rarer variants). We considered twelve different functional annotation domains, five in or near protein-coding regions and seven main chromatin segmentation states, defined using data from a cell line informative for lipid traits (HepG2; Supplementary Table 12). Low-frequency variants in exonic regions displayed the strongest degree of enrichment (25-fold, compared to fivefold for common variants, Fig. 5), compatible with the effect of purifying selection39. Importantly, however, we showed nearly as strong levels of functional enrichment at both sets of variants for several non-coding domains (∼10- to 20-fold for transcription start sites, DNase I hotspots and 3′ UTRs of genes), confirming the important contribution of non-coding low-frequency alleles to phenotypic trait variance.
Findings from the exome arm of UK10K
In the UK10K-exomes arm studies (see Supplementary Table 13), 5,182 individuals passed sequencing quality control with an average read depth of 80× in the bait regions. We analysed variation discovered in 3,463 disease-affected, unrelated, European-ancestry samples (Supplementary Information). We discovered 842,646 SNVs (of which 1.6% were multiallelic) and 6,067 INDELs. Both variant types were dominated by very rare variants, with more than 60% observed in only one individual. (Extended Data Fig. 6). When compared to European-American samples from the NHLBI Exome Sequencing Project (ESP)39, we found near-complete overlap at sites with MAF ≥ 1%: 99% of SNVs that are well covered by both projects and pass quality control are present in both data sets. By contrast, 72% of well-covered SNVs seen only once or twice in UK10K are present in ESP. To inform the functional annotation of these variants, we used the Illumina Body Map to determine if the frequency of LoF and functional variants changed when transcripts are selected based on their expression level (Extended Data Fig. 7). When only consequences from highly expressed transcripts and especially those highly expressed in all the Body Map tissues were considered, LoF and functional changes declined. This demonstrates that the choice of transcript can affect the consequence and this should be taken into account when annotating patient exomes.
The rare disease collection studied 1,000 exomes, or ∼125 from each of eight rare diseases. Thus far, 25 novel genetic causes have been identified for five of the eight diseases: ciliopathies (n = 14), neuromuscular disorders (n = 7), eye malformations (n = 2), congenital heart defects (n = 1) and intellectual disability (n = 1; Supplementary Table 14). Notably, there was marked variation in our ability to identify causal variants based on familial recurrence risk, with the primary factors appearing to be: (1) the proportion of patients with a monogenic cause, (2) the strength of prior information about the mode of inheritance (for example, dominant, recessive), and (3) the extent of prior knowledge of the relevant functional pathways. In contrast with our success identifying single-diagnostic variants in these rare diseases, our analysis of three complex diseases (obesity, autism spectrum disorder and schizophrenia) on their own did not yield replicating disease-associated loci. This is perhaps unsurprising given expected locus and allelic heterogeneity, and modest sample size40. We therefore engaged in a collaborative meta-analysis as part of the Autism Sequencing Consortium41 which identified 13 associated genes (FDR < 0.01), many of which have been previously shown to cause intellectual disability or developmental disorders. This suggests that rare variation in single genes can have a large role causing a subset of autism spectrum disorder, but these effects only become apparent when large numbers of individuals are studied.
We also used the UK10K-exomes sequence data to explore the occurrence of incidental findings. We focused on disease-specific genes identified in current guidelines for the analysis of exome/whole-genome data by the American College of Medical Genetics and Genomics (ACMG)42, and used objective criteria described in the Supplementary Information. We identified a total of 29 distinct reportable variants affecting a total of 2.3% of the UK10K cases considered in this analysis (42 out of 1,805 individuals), a number similar to previous estimates (2% estimate in adults of European ancestry43). The incidental findings were predominantly associated with cardiovascular disorders (Supplementary Table 15).
Two main challenges of reporting incidental findings from whole-exome surveys emerge. The need for clinical expertise, the difficulty of interpreting a fraction of variants, and the lack of completeness of the ClinVar database44 all highlighted the need to further consolidate knowledge from the community into freely accessible and more exhaustive databases. Furthermore, for some disorders, the frequency of carriers is likely to be too high compared to the disease frequency, despite our strict assessment criteria. This suggests that reported estimates of the penetrance of recognized variants for specific disorders are too high. Given these challenges, we suggest that, in the absence of additional evidence, scientific publications describing proposed penetrant associations for rare variants need to be complemented by accurate estimates of population frequencies.
In summary we have generated a high-quality whole-genome sequence data repository including 24 million novel variants from nearly 4,000 European-ancestry individuals. We showed that the UK10K haplotype reference panel greatly increases accuracy and coverage of low-frequency and rare variants compared to existing panels such as the 1000GP phase 1 panel. We carried out a large-scale empirical exploration of association testing of common, low-frequency and rare genetic variants with a large variety of biomedically important quantitative traits. For each of the different association scenarios tested, we report first examples of novel alleles associated with lipid and adiponectin traits. This provides proof-of-principle evidence on the value of the large-scale sequencing data for complex traits, while also indicating that there are few low-frequency large effect ‘quick wins’ that make substantial contributions to population trait variation and that can be discovered from sequencing studies of few thousands individuals. Our power calculations, informed by the sequence data, provide realistic estimates of the benefit of sequencing versus imputation in future association studies. Finally, rare variation tests showed limited evidence for confounding owing to population stratification at the traits investigated, likely to be due to a weakening of historical patterns of population structure in the current general UK population45.
Overall, this effort has given us both new genomic tools12 and insights into the role of low-frequency and rare variation on human complex traits, and will inform strategies for future association studies. Our exploration of non-coding variants supports the need for incorporating functional genome information in association tests of rare variants outside protein-coding regions. Improved study power through larger numbers, and a better understanding of the observed heterogeneity in allelic architecture between different loci, are likely to provide the best route forward to describe the contribution of rare variants to phenotypic variance in health and disease, and for assessing their utility in healthcare.
Data access form is available at http://www.uk10k.org/data_access.html, raw and processed data files at https://www.ebi.ac.uk/ega/, imputation panel at https://www.ebi.ac.uk/ega/, UK10K Genome Browser at http://www.uk10k.org/dalliance.html, single-marker loci navigator at http://fathmm.biocompute.org.uk/UK10K_Browser/ and dynamic power calculator at http://fathmm.biocompute.org.uk/UK10K_Browser/Power.htm. All sequence and phenotype data were deposited to the European Genome-Phenome archive (EGA, https://www.ebi.ac.uk/ega/), with accession numbers EGAD00001000740, EGAD00001000789, EGAD00001000741, EGAD00001000790, EGAD00001000776, EGAD00001000433, EGAD00001000434, EGAD00001000435, EGAD00001000436, EGAD00001000613, EGAD00001000614, EGAD00001000437, EGAD00001000438, EGAD00001000615, EGAD00001000439, EGAD00001000440, EGAD00001000441, EGAD00001000442, EGAD00001000443, EGAD00001000430, EGAD00001000431, EGAD00001000432, EGAD00001000429, EGAD00001000413, EGAD00001000414, EGAD00001000415, EGAD00001000416, EGAD00001000417, EGAD00001000418, EGAD00001000419 and EGAD00001000420. A breakdown of studies is given in Supplementary Table 13. All study participants provided informed consent. Details of REC approvals are given in Supplementary Table 17.
This study makes use of data generated by the UK10K Consortium. The Wellcome Trust provided funding for UK10K (WT091310). Additional grant support and acknowledgements can be found in the Supplementary Information.
Extended data figures
This file contains Supplementary Tables 1-19.
About this article
Evidence of causal effect of major depression on alcohol dependence: findings from the psychiatric genomics consortium
Psychological Medicine (2019)