We report ∼17.6 million genetic variants from whole-genome sequencing of 2,120 Sardinians; 22% are absent from previous sequencing-based compilations and are enriched for predicted functional consequences. Furthermore, ∼76,000 variants common in our sample (frequency >5%) are rare elsewhere (<0.5% in the 1000 Genomes Project). We assessed the impact of these variants on circulating lipid levels and five inflammatory biomarkers. We observe 14 signals, including 2 major new loci, for lipid levels and 19 signals, including 2 new loci, for inflammatory markers. The new associations would have been missed in analyses based on 1000 Genomes Project data, underlining the advantages of large-scale sequencing in this founder population.
Studies of common genetic variants have provided entry points to analyze the mechanisms underlying many complex traits and diseases1,2,3,4. Extension of these studies to the large reservoir of rare and population-specific variants could accelerate the translation of genetic information into biological understanding but has not thus far been systematically applied5,6. Rare variants can be discovered and genotyped with rapidly improving DNA sequencing techniques, but designing studies in which enough copies of each variant can be observed to detect genetic associations is challenging5,6,7. Studies of families and founder populations, where variants that are rare or absent elsewhere can occur at moderate frequencies, help overcome these limitations8. Here we use genome sequencing in the Sardinian founder population to systematically assess the contribution of genetic variation to quantitative traits, using as examples the levels of blood lipids and inflammatory markers. Discovery of variants associated with these traits could further elucidate causal mechanisms and pathways for cardiovascular diseases and other complex disorders9,10,11. In addition to confirming signals from studies of common variants12,13, our results identify new genetic variants and associations that would be missed using sequence-based reference panels derived from more cosmopolitan populations.
Sequencing and rare variant yield
We generated whole-genome shotgun sequence data for 2,120 Sardinian individuals, either living in the Lanusei valley and participating in a cohort study of quantitative traits (the SardiNIA study14; 1,122 individuals; 52.8% female and average age of 49.4 years) or from across the island and participating in case-control studies of multiple sclerosis15 and type 1 diabetes16 (referred to here as the 'island-wide sample'; 998 individuals; 48.5% female and average age of 41.6 years). Among these individuals, we sequenced 1,190 parent-offspring pairs distributed across 695 nuclear families to facilitate the high-quality estimation of haplotypes and genotypes17. For each individual, we generated an average of 10.7 × 109 mapped bases of high-quality sequence (∼4-fold coverage of the genome), corresponding to a total of 22.7 × 1012 bases across all individuals. We implemented quality control, alignment, variant calling and genotyping protocols that efficiently handled a sample of this size18 (see URLs and Online Methods).
For each sequenced individual, we identified an average of 3.4 million variants (17.6 million variants overall; Table 1). To assess quality, we sequenced one parent-offspring trio to >65× coverage for each individual. Comparing our initial low-coverage analysis with the results of deep sequencing for this family trio, we estimate an average genotyping error rate of <0.7% at heterozygous sites. As expected19, this error rate was lower at sites with minor allele frequency (MAF) >5%, averaging 0.5%, and higher at sites with MAF <5%, averaging ∼2% (Supplementary Table 1). Comparing sequence and array genotyping results for 1,068 individuals, we estimate that we have discovered and genotyped >99% of the variants with frequency >0.5% in our sample and ∼70% of the variants with frequency <0.5% (Supplementary Table 2).
Of the 17.6 million variants discovered, 172,988 (0.98%) overlapped protein-coding sequences20 (Table 1). Of these variants, 84,312 were nonsynonymous coding changes, 2,504 were variants altering essential splice sites and 2,013 were nonsense variants. Consistent with the hypothesis that natural selection makes variants with strong biological impact more likely to be rare and/or geographically restricted, we observed that 59% of the nonsynonymous variants, 53% of the splice-site altering variants and 70% of the nonsense variants had frequency <0.5% (in comparison to 48% of variants across the genome). We also observed that 12% of the nonsynonymous variants, 22% of the splice-site variants and 22% of the nonsense variants were absent from previous sequencing studies (as compared to 22% of all variants, using dbSNP 142 and the Exome Aggregation Consortium (ExAC) databases (see URLs) as surrogates for the results of previous studies21).
Because of genetic drift and, to a lesser extent, natural selection following the settlement of Sardinia, many genetic variants that are rare elsewhere in Europe have now reached higher frequencies in Sardinia22,23. The consequences of this genetic differentiation include a relatively large fraction of population-specific low-frequency variants and long haplotypes shared among present-day carriers of these variants24. For example, 98% of the variants present at a frequency of ∼1.0% (and 99.7% of the variants present at a frequency of ∼5.0%) in a sample of ∼2,500 individuals from the UK are also present in Phase 1 of the 1000 Genomes Project25. By contrast, only 77% of the variants with a frequency of ∼1.0% (and 99.3% of the variants with a frequency of ∼5.0%) in our sample are present in Phase 1 of the 1000 Genomes Project25. Overall, we estimate that 76,286 variants that are very rare (frequency <0.5%) or absent in the 1000 Genomes Project Phase 3 panel reach frequencies >5% in our sample. We used a machine learning–based scoring algorithm to summarize the deleteriousness of each variant in a Combined Annotation-Dependent Depletion (CADD) score26. Coding variants that are unique to Sardinia appear to be significantly more deleterious than variants of the same frequency that are also observed in the 1000 Genomes Project Phase 3 panel (P = 0.02) (Supplementary Fig. 1).
The differentiation of allele frequencies in the Sardinian sample from those in other European populations is also evident in assessments using the FST differentiation statistic as well as in a principal-component analysis (PCA) of common variants27,28 (Supplementary Figs. 2 and 3). Whereas FST between non-Sardinian European populations in the POPRES reference sample averaged 0.001 (range of 0.000–0.004), FST between the island-wide sample of Sardinians and POPRES European populations averaged 0.006 (range of 0.003–0.010), and the difference was even greater between the Lanusei valley cohort and the POPRES European populations (database of Genotypes and Phenotypes (dbGaP), phs000145.v4.p2) (average of 0.009, range of 0.006–0.013) (Supplementary Fig. 2). The geographical structure is even more evident when considering less frequent alleles: for rare sites, allele sharing by mainland populations and Sardinians is particularly depressed relative to sharing within mainland populations (such at the 1000 Genomes Project CEU (Utah residents of Northern and Western European ancestry) and TSI (Tuscans from Italy) populations)29,30 (Fig. 1). The patterns of differentiation are again clear in the long identical haplotypes surrounding rare f2 variants (variants that are observed on exactly two chromosomes from distinct individuals)25,31 (Fig. 2). Of note, both Sardinian samples showed similar lengths for haplotypes flanking f2 variants that are shared with populations outside Sardinia, consistent with a common ancient demography. The greater relative isolation of the two Sardinian samples was evident when we examined the lengths of the haplotypes flanking the f2 variants present within each sample. For variants shared by individuals in the Lanusei valley, the flanking haplotypes averaged 3,570 kb, with the length dropping to 735 kb when first- and second-degree relatives were excluded from consideration. The haplotypes averaged 580 kb when the variant was shared by a Lanusei valley resident and an individual elsewhere in Sardinia; ∼382 kb when the variant was shared with a European sequenced in the 1000 Genomes Project Phase 3 panel; and ∼264 kb when the variant was shared with an individual elsewhere in the world in the full set of 1000 Genomes Project data (Fig. 2). These differences in haplotype length were less marked around variants with higher frequencies and hence shared by more than two heterozygous individuals (f3, f4, etc.). This was evident even when comparing samples from the Lanusei valley and elsewhere on the island (Supplementary Table 3).
Relatedness and imputation for the Lanusei valley samples
Participants in the SardiNIA study all live in four towns in the Lanusei valley. The population in this region is relatively stable: all four grandparents were born in the Lanusei valley for at least three-quarters of the study participants14. A total of 6,602 individuals from the SardiNIA study were genotyped with 4 Illumina arrays (OmniExpress, ExomeChip, Metabochip and Immunochip), providing a scaffold of 890,542 unique SNPs across the genome. Because participants share long stretches of DNA, the genetic information obtained for any individual can be propagated ('imputed') to close relatives genotyped with the scaffold of markers32,33. To increase the power of genetic association analyses and to sample genetic diversity in the valley, we sequenced individuals distributed across different families (Supplementary Table 4). We then searched for chromosome stretches shared by the sequenced individuals and the remaining study participants, which allowed us to impute both common and rare variants exceedingly well. The imputation accuracy, measured as the squared correlation between the imputed and laboratory-generated genotypes, was r2 = 0.98 for variants with frequency >5% and 0.89 for variants with frequency 0.5–1.0% (Supplementary Fig. 4). This accuracy improved markedly in comparison to the imputation results based on the 1000 Genomes Project Phase 3 panel, which includes individuals representing genetic diversity across Europe and elsewhere in the world (r2 = 0.92 and 0.62 for variants with MAF >5% and 0.5–1.0%, respectively; Supplementary Fig. 4). The shared stretches of chromosome used to fill in missing data for each SardiNIA individual originated in other individuals from the valley ∼87% of the time and also strongly correlated with the number of their grandparents born in the area (r2 = 0.67; Supplementary Fig. 5).
Impact on association of lipid and inflammatory markers
We focused on the levels of four blood lipids (low-density lipoprotein (LDL) cholesterol, total cholesterol, triglycerides and high-density lipoprotein (HDL) cholesterol) to assess how sequence information might be used to discover the effects of population-specific and low-frequency variation on extensively studied traits (Supplementary Table 4)12. Imputing variants from the sequencing effort on the scaffold of genotyped SNPs expanded the spectrum of variants for association testing in the sample from the Lanusei valley to ∼13.6 million (selected with high imputation quality; Online Methods)34. Overall, we identified 14 independently associated variants distributed across 11 loci meeting the classical genome-wide significance threshold of P = 5 × 10−8 associated with lipid levels in analysis including all individuals or in sex-restricted analysis including only males or only females (Table 2 and Supplementary Figs. 6 and 7). These included ten variants with moderate effect tagging signals in LIPC, SORT1, PCSK9, CILP2, CEPT and APOA5 (one signal each) and LPL and APOE (two signals each)—loci that have been extensively described in previous genome-wide association studies (GWAS) and other association studies. Other signals at known loci were detected at lower levels of significance (Supplementary Table 5). To declare new genome-wide signals, we used a threshold of P = 6.9 × 10−9, which was calculated by empirically estimating the number of independent tests in a Sardinian genome (Online Methods and Supplementary Table 6).
The results implicate three variants that are rare or absent elsewhere in the world and were missed in studies of European-ancestry samples that included >100,000 individuals12. We previously identified one of these loci through a Sanger sequencing–based effort35: the variant in LDLR (frequency = 0.5%) encoding p.Val578Ala (Supplementary Table 5) was associated with LDL cholesterol and total cholesterol levels and was independent of the known variant, rs73015013 (frequency = 14%; effect = –5.2 mg/dl; P = 6.4 × 10−8; r2 < 0.001). Here we report a new association for triglyceride levels with a missense variant in APOA5 (frequency = 3% in Sardinians; effect = –20.7 mg/dl; P = 1.2 × 10−12) (Table 2). The variant, encoding a p.Arg282Ser substitution, was genotyped and included on the ExomeChip array after it was discovered in our sequencing effort, and, thus far, it has been found on only 2 chromosomes in >30,000 Europeans characterized by the Exome Aggregation Consortium. Of note, this variant was the strongest variant modulating triglyceride levels in Sardinia, explaining almost 1% of the phenotypic variance, and was also independent of the known common variant in the locus, rs10750097 (frequency = 17%; effect = 11.9 mg/dl; P = 4.6 × 10−9; r2 = 0.002) (Table 2). These two examples illustrate the coexistence in the same locus of population-specific low-frequency variants and previously detected and independently associated common variants from cosmopolitan populations (Fig. 3). The third genetic variant was a stop-gain mutation in the HBB gene encoding p.Gln40*, better known as the β039 mutation because the corresponding codon was numbered 39 before the last update of standard protein nomenclature. This variant illustrates how variants that are unusually frequent in Sardinia can provide insights about biology. In Sardinia, this mutation is the common cause of autosomal recessive β-thalassemia36. In our sample, in agreement with earlier epidemiological findings37,38, the heterozygous state was associated with 13.9 mg/dl lower LDL cholesterol levels (P = 1.2 × 10−20) and 16.9 mg/dl lower total cholesterol levels (P = 1.2 × 10−22). Of note, the analysis after 1000 Genomes Project Phase 3 imputation pointed only to an intergenic marker (rs76053862) 122 kb away from the β039 variant corresponding to the second most associated SNP using the Sardinian reference panel, with a much weaker association signal (P = 1.4 × 10−13) (Fig. 3). Finally, two additional signals were observed for total cholesterol levels at rs115048493 near the TMEM33 and DCAF4L1 genes (P = 6.94 × 10−9) and for HDL cholesterol levels at rs8092903 near TGIF1 in females (P = 4.49 × 10−8) (Table 2 and Supplementary Table 7), although the biological bases for these associations are presently unclear. Because these signals are below our adjusted genome-wide significance threshold of P = 6.9 × 10−9, these findings remain tentative.
We were interested in determining whether analyses based on the 1000 Genomes Project and HapMap reference panels would also miss important loci for other traits. In a second example of a class of especially interesting traits, we focused on the levels of five inflammatory markers. In a previous study, assessing ∼2 million genotyped and HapMap-imputed SNPs in the SardiNIA cohort, we found 16 variants associated with at least one of the 4 inflammatory markers measured: interleukin-6 (IL-6) concentration, erythrocyte sedimentation rate (ESR), monocyte chemotactic protein-1 (MCP-1) levels and high-sensitivity C-reactive protein (hsCRP) levels13. A fifth inflammatory marker, adiponectin (ADPN), showed no significant association in our previous analyses (E.P., S.N., S.S., D.S. and F.C., unpublished data). Nevertheless, with the extended spectrum of variants assessed here, we identified another seven variants associated with MCP-1, hsCRP, ESR or ADPN, at the classical significance threshold of P = 5 × 10−8, with five variants in four previously undetected loci as well as two signals at coding variants in known variants (Table 3 and Supplementary Figs. 6, 7, 8). Among the newly identified signals, three remained significant even with the more stringent threshold of P = 6.9 × 10−9. In comparison to analyses based on HapMap or 1000 Genomes Project imputation, we also identified more strongly associated lead variants at three known loci (APOE, HBB and RHCE) (Table 3). These signals may point to causative variants, as supported by biological evidence, expression quantitative trait locus (eQTL) data and Encyclopedia of DNA Elements (ENCODE) annotation.
In detail, we found a striking new signal associated with both hsCRP (rs183233091, P = 1.1 × 10−28) and ESR (12:125406340, P = 4.4 × 10−23) on chromosome 12, in a stretch of rare variants encompassing several genes (Fig. 4). The lead variants were not the same but were partially in linkage disequilibrium (LD) (r2 = 0.19, D′ = 0.79), and the association with hsCRP disappeared when conditioning for the lead variant for ESR and vice versa. This implies that the two signals are likely due to the same variant(s), an inference that is also consistent with the biological correlation of these two traits. The rare alleles at the lead variants increased the levels of both inflammatory traits, with effects that appeared to be stronger in males (Supplementary Table 8). The extended associated region spanned 5.4 Mb and included 22 noncoding variants with association P value <1 × 10−15 (Supplementary Fig. 9). The majority, to our knowledge, are specific to Sardinians, as only ten were found in either the 1000 Genomes Project Phase 3 panel or the Genomes of the Netherlands (GoNL) project database39 (four with MAF 0.1–1% and the other six with MAF >1% in Europeans). The association of the latter 6 variants with hsCRP was tested for replication in 7,689 European individuals from 8 GWAS cohorts, but no signal was seen (Supplementary Table 9), whereas nominal association was detected in a subset of 3,505 southern European individuals for the top variant (one-tailed P value = 0.04). These results allow us to exclude these SNPs as causal and indicate that the association is instead primarily driven by a variant among those that are extremely rare or absent outside Sardinia.
For hsCRP, we detected additional candidate signals. There was one signal near PDGFRL, a gene previously implicated in inflammatory and autoimmune processes40,41 (Supplementary Fig. 8), which we again failed to confirm in the replication sample set. Currently, there is no other evidence that this signal is genuine, and further studies will be required to assess it. Two additional new signals reached the classical significance threshold of P = 5 × 10−8 but not the more stringent threshold for new findings: one for ADPN at 13:108884835 near the ABHD13 gene (P = 3.3 × 10−8) and another for MCP-1 at rs76135610 (P = 1.8 × 10−8) in a region encompassing the CBLN1 and N4BP1 genes, which was associated in females only (Table 3 and Supplementary Fig. 8).
We uncovered two new independent variants for MCP-1 that result in non-conservative, likely functional amino acid changes. The p.Arg89Cys substitution (rs34599082) in DARC causes the FYB-weak phenotype of reduced antigen expression and decreased ability to bind chemokines42, and the p.Met249Lys substitution in the transmembrane domain of CCR2 is expected to affect molecular interactions and thereby alter the downstream signal transduction of bound ligand43.
Finally, better lead variants were found at three known loci. For hsCRP, the known association signal near the APOE gene was mapped to the known nonsynonymous causal variant, encoding p.Cys130Arg. This SNP has been associated with Alzheimer's disease and directly with C-reactive protein (CRP) levels, both by candidate gene studies and very recently by exome sequencing–based GWAS44,45; it also coincides with the independent signal for LDL cholesterol, linking lipid levels to inflammatory marker regulation. Two new lead variants were found for ESR. One again points to the mutation in HBB encoding p.Gln40* (Supplementary Fig. 8), consistent with the effect of this mutation on red blood cell counts (as shown above for LDL cholesterol), which are in turn inversely correlated with ESR values. This association is thus relevant when interpreting ESR values in these individuals. Finally, a previously reported association on chromosome 1 in an intron of the TMEM57 gene13,46 was refined to intron 3 of the nearby RHCE gene. This gene encodes the Rh blood group antigens, and ESR levels are higher in Rh-positive than in Rh-negative healthy adults, making RHCE a plausible candidate gene (Table 3). The lead SNP at this locus alters several regulatory motifs (according to ENCODE annotation at the UCSC Genome Browser; see URLs) and is strongly correlated (r2 = 0.80) with a nearby eQTL variant (rs11802413 in TMEM57) that affects expression of TMEM57 as well as RHCE in liver47.
We also performed gene-based rare variant tests using combined and multivariate collapsing (CMC) and variable threshold (VT) tests. Six loci passed the Bonferroni threshold of P = 5 × 10−6 for significance (Online Methods), but after conditional analysis only two were not driven by nearby associations detected in our single-variant GWAS analysis. Particularly strong associations were observed for STAB1 (P = 4.7 × 10−10) and ADPN levels, and another was observed for PTPRH (P = 8.3 × 10−7) and ESR. These signals, however, were not further investigated (Table 4 and Supplementary Table 10), as information on these traits was not available in the replication cohorts.
All newly associated variants for both blood lipid levels and inflammatory markers were validated by Sanger sequencing (Online Methods and Supplementary Table 11). Using 1000 Genomes Project imputation, no other signals were identified and all the new signals were either mislocalized (as in the case of the signal corresponding to p.Gln40*, which pointed to other nearby variants) or completely missed (Figs. 3 and 4, Supplementary Fig. 8 and Supplementary Tables 12 and 13).
Further illustrating the high resolving power of the sequence-based association analyses, CADD assessment showed that all five new genome-wide signals as well as the two new independent signals had the highest CADD scores in their regions in comparison with signals in high or moderate LD (r2 >0.5), supporting their potential causative role in trait variation (Supplementary Tables 14 and 15). By contrast, only 6 signals among 23 at known loci for the lipids and inflammatory markers, typically driven by common variants, had top CADD scores, suggesting that the observation for the 7 new signals reflects advantages of studying rare or population-specific variation.
Finally, we used variance-component methods to estimate the combined contribution to lipid levels and inflammatory markers of all the variants we discovered by sequencing48. Together, the variants identified in our sequencing study and successfully imputed explain about half of the heritability for the traits under analysis, with the sole exception of hsCRP levels, for which they explain almost all of the observed trait heritability (Supplementary Fig. 10 and Supplementary Tables 16 and 17). The missing heritability that could not be explained by sequenced variants might be attributable to variants not assessed here, including very rare variants that were not discovered or were poorly imputed or structural variants that were not considered in the present study.
Our findings, besides elucidating at an unprecedentedly deep level of resolution the genetic structure and substructure of Sardinians, demonstrate the value of whole-genome sequencing–based association studies in this founder population. In Sardinia, variants that are extremely rare in the rest of the world can reach high enough frequencies to provide clear and, in some cases, unexpected biological insights49. For example, we found that the β039 stop-gain mutation causing autosomal recessive β-thalassemia36 accounts for a large fraction of variability in LDL cholesterol levels in Sardinia, second only to the APOE variants. The variant is known to be associated with enhanced erythropoiesis (ref. 36 and the companion paper50), with heterozygous carriers having red blood cell counts 23% higher on average (P < 1 × 10−300). This provides a likely explanation for the decreased lipid levels in carriers: large amounts of cholesterol are required for the replenishment and regeneration of cell membranes and intracellular structures in circulating cells and their bone marrow precursors. Although this stop-gain mutation reaches a frequency of 5.0% in our sample, it is not included on standard genotyping arrays and cannot be easily imputed from the HapMap or 1000 Genomes Project panels because it is very rare outside Sardinia (1000 Genomes Project frequency <0.1%). Likewise, by cross-population exclusion mapping, we show that the new strong association signal with both hsCRP and ESR on chromosome 12 is most likely driven by a variant that is extremely rare or absent outside Sardinia.
Furthermore, coding variants that are unique to Sardinia appear to be significantly more deleterious than variants of the same frequency that are also observed in more cosmopolitan collections of samples (Supplementary Fig. 1). This suggests that part of the reservoir of variants that have drifted to higher frequencies in Sardinia and were lost or are extremely rare elsewhere could be especially informative for genetic association and functional studies. The results presented here show a few clear examples.
At the same time, our observations also illustrate the difficulties that will be encountered when attempting to replicate founder variant association results, with the new signals we identified being typically due to variants that are extremely rare or absent elsewhere in the world. In our view, when the variant is present in other populations, evidence for association there could be used to confirm the signal and lack of association could be used to exclude variants as being causal. However, when rare and founder variants are not shared, as will often be the case, confirming the validity of results will require accumulating additional samples in the population initially studied or may depend increasingly on additional criteria such as examination of association at other variants in the same genomic region or the use of more stringent significance levels. Our study demonstrates the benefits of combining high-throughput sequencing and genotyping technologies with imputation methods and customized study designs; we obtained high-quality information on the genomes of >6,000 individuals for an investment that, using conventional deep whole-genome sequencing strategies, would have allowed deep sequencing of only 160–180 genomes. This cost-effective approach increases power in genetic analysis (ref. 19 and the companion papers50,51) and creates the bases for larger research and personalized medicine programs.
To survey genetic variation across Sardinia, we selected individuals participating in the SardiNIA longitudinal study of aging14 or in case-control studies of multiple sclerosis15 and type 1 diabetes16. All participants gave informed consent to study protocols, which were approved by the Sardinian local research ethic committees: Comitato Etico di Azienda Sanitaria Locale 8, Lanusei (2009/0016600) and Comitato Etico di Azienda Sanitaria Locale 1, Sassari (2171/CE) and by the NIH Office of Human Subject Research as governed by Italian institutional review board approval.
The SardiNIA project includes 6,921 individuals, representing >60% of the adult population of 4 villages in the Lanusei valley on Sardinia. Details of phenotype assessments for these samples have been published previously13,14. In particular, LDL cholesterol levels were estimated using the Friedewald formula. Individuals with triglyceride concentrations >400 mg/dl or those taking lipid-lowering medications were excluded from the LDL cholesterol analysis, and those on medication were also excluded from analyses of all lipid traits. Summary statistics for individuals considered for the GWAS analyses are reported in Supplementary Table 3.
When array genotype data are available, sequencing a subset of individuals in a family allows for missing genotypes to be imputed in the remaining individuals by tracking haplotype segregation through the family32,53. We used known family relationships among SardiNIA study participants and the ExomePicks program (see URLs) to prioritize individuals for sequencing. For each family, the program identifies subsets of individuals whose haplotypes can be estimated very accurately (for example, parent-offspring trios) and estimates the fraction of the genome for each additional family member that can be imputed using these haplotypes.
Our ongoing case-control studies of type 1 diabetes and multiple sclerosis include 10,106 individuals and 1,109 nuclear families, each with an affected child and 2 unaffected parents. Participants were recruited through regional clinics and hospitals distributed throughout Sardinia, with the majority of participants recruited in Cagliari (in the south of Sardinia) or Sassari (in the north). Again, we favored sequencing of parent-offspring trios to improve the accuracy of the resulting haplotypes17. Part of the sequencing data used in this study are available through dbGaP, under the SardiNIA Medical Sequencing Discovery Project, study accession phs000313.v3.p2.
All SardiNIA study samples were genotyped with four different Illumina Infinium arrays: one high-density array, OmniExpress, which surveyed common variation across the genome, and three low-density targeted arrays that provided improved coverage of regions associated with cardiovascular and metabolic disease on the Cardio-Metabochip54, immune disorders on the Immunochip55 and coding variation on the ExomeChip (see URLs). Genotyping was carried out according to the manufacturer's protocols at the SardiNIA Project Laboratory (Lanusei, Italy), at the Technological Centre–Porto Conte Ricerche (Alghero, Italy) and at the National Institute on Aging Intramural Research Program Laboratory (Baltimore, Maryland, USA). Genotypes were called using GenomeStudio (version 1.9.4) and refined using zCall (version 3)56. We applied standard per-sample quality control filters to remove samples with low call rates or where the reported relationships and/or sex disagreed with genetic data57. We also applied per-marker quality control filters to remove markers with low call rates, deviations from Hardy-Weinberg equilibrium, excess discordance among duplicates or identical twin genotypes, excess Mendelian inconsistencies or MAF = 0. Altogether, 890,542 autosomal markers and 16,325 X-linked markers that were unique were genotyped across the SardiNIA study samples. Among the autosomal markers passing quality control, 809,193 are array specific (60,966 from ExomeChip, 112,717 from Immunochip, 100,554 from Metabochip and 534,956 from OmniExpress) and 972 SNPs were typed on all 4 arrays. The remaining 80,377 SNPs were typed on 2 or 3 arrays. For the 870,108,399 genotypes assayed on >1 array, the genotype concordance rate was >99.99%. Our analyses include the 6,602 individuals that were successfully genotyped with all 4 arrays.
Sequence data were generated at the Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna (CRS4) and at the University of Michigan Medical School Core Sequencing Laboratory. Libraries were generated from 3–5 μg of genomic DNA using sample preparation kits from Illumina and New England BioLabs. Paired-end sequence reads (typically 100 to 120 bp in length) were generated with Illumina Genome Analyzer IIx, Illumina HiSeq 2000 and Illumina HiSeq 2500 instruments. Samples were sequenced to an average depth of 4.16×. A single nuclear family (two parents and one child) was sequenced to an average depth >65× per individual to facilitate the assessment of genotyping error rates.
Reads were aligned to the human reference genome (GRCh37 assembly with decoy sequences, as available from the 1000 Genomes Project ftp site; see URLs) using BWA-0.5.9 (ref. 58), trimming read tails with average base quality <15. After alignment, base qualities were recalibrated, and duplicate reads were flagged and excluded from analysis. We reviewed the summary metrics generated using QPLOT59 and verifyBamId60 for each aligned sample, to remove samples with low sequencing depth, poor coverage of regions with high or low GC content, or evidence for sample contamination.
Variant calling and genotyping was carried out using our GotCloud pipeline (see URLs). Briefly, GotCloud organizes large sequence analysis jobs into many small jobs that can be distributed across a high-performance computing cluster. GotCloud previously contributed to variant calls for the 1000 Genomes Project and the National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project. The approach examines all samples jointly to identify an initial variant list, improving the ability to detect low-frequency variants with low-coverage data. This initial list of variants is then annotated with information on sequencing depth, mapping quality, the ratio of the reference and alternate alleles at heterozygous sites, information on the evidence for alternate alleles by strand and read position, excess heterozygosity and others. This information was used to build a support vector machine (SVM)-based classifier to distinguish between true variants (such as those seen in HapMap or validated by the 1000 Genomes Project using Omni arrays) and likely false positive variants. The list of likely false positives was seeded with variants that had extreme sequencing depth and unbalanced representation of reference and alternate alleles, both by strand and position. Finally, using the list of likely high-quality sites, genotypes were estimated using the haplotype-aware calling algorithms implemented in BEAGLE (see URLs), to generate initial haplotype estimates, and TrioCaller17, to refine this initial haplotype set. The entire computational process required approximately 20 years of computing time (6 CPU years for quality control and alignment and 14 CPU years for variant discovery and genotyping). The likely functional impact of variants was annotated using CADD scores26 and the Ensembl Variant Effect Predictor20.
Variant discovery power.
To evaluate our power to discover rare variants through low-pass sequencing, we examined 1,068 samples that were both sequenced and genotyped with the 4 genotyping arrays previously described. The four arrays provided us with an incomplete but high-quality catalog of low-frequency variants in these samples. We organized these variants by frequency and tabulated the fraction of variants that were rediscovered in our sequencing-based analysis for each frequency bin. Overall, we estimate that our sequencing effort discovered ∼70% of the variants with frequency <0.5%, 98.8% of variants with frequency 0.5–5%, and >99% of variants with frequency >5% (Supplementary Table 2).
Haplotyping and imputation.
Genotypes were phased with MACH software61, using 30 iterations of the haplotyping Markov chain and 400 states per iteration. Imputation used Minimac software62 and a reference panel including haplotypes estimated by sequencing. To reduce the number of duplicated haplotypes, whenever a parent-offspring trio was sequenced, only the parental haplotypes were included in the imputation reference panel (resulting in 1,488 individuals for imputation). To reduce computational effort, we did not attempt to impute singleton variants. After imputation, we retained for association analysis only markers with an imputation quality (RSQR) >0.3 or >0.6 if the estimated MAF was ≥1% or <1%, respectively34. For comparison, we repeated imputation using the 1000 Genomes Project Phase 3 haplotype set (using all 2,504 available samples; November 2014 release) and used RSQR >0.3 for all variants as a filter for imputation accuracy, as suggested in ref. 34. This strategy led to the identification of 13.6 million and 12.7 million markers useful for analyses on the Sardinian-based and 1000 Genomes Project–based data sets, respectively.
Estimates of imputation accuracy.
To further evaluate imputation accuracy, we carried out imputation using the Cardio-Metabochip, Immunochip and OmniExpress arrays as a scaffold and compared imputed genotypes with ExomeChip genotypes. This comparison excluded any markers that overlapped among the three scaffold arrays and the ExomeChip (Supplementary Fig. 4). To track the origin of the haplotypes used as templates during imputation, we interspersed dummy markers in the haplotypes, arbitrarily labeled with allele 1 for individuals recruited from the Lanusei valley and labeled with allele 0 for individuals recruited elsewhere in SardiNIA.
Population structure analyses.
To calculate FST we used a random sampling of 200 unrelated individuals from the Lanusei valley and 200 individuals from the case-control control cohort study together with all POPRES European populations with sample sizes greater than 15. To obtain unrelated Sardinian individuals, we removed a random individual from each putative related pair until no pairs of individuals had an estimated proportion of identity-by-descent (IBD) sharing ≥0.05 (as measured using PLINK on the basis of variants with MAF >5%). We calculated Weir and Cockerham FST values for all pairs of populations. Significance was assessed by 1,000 permutations of individual labels between a given pair of populations (Supplementary Fig. 2). Principal-component analysis was performed using EIGENSTRAT version 5.0 after removing one SNP from each pair of SNPs with r2 ≥0.8 (in windows of 50 SNPs and steps of 5 SNPs) as well as SNPs in regions known to exhibit extended long-range LD63. We first considered a subset of 400 unrelated Sardinians along with all POPRES European populations. We then considered the full set of sequenced genomes and projected samples into an existing principal-component analysis coordinate space, one at a time (Supplementary Fig. 3). This analysis requires a small adjustment to the placement of each sample, which otherwise would be shifted toward the origin64. To address this, we devised a regression-based empirical correction scheme (D.O.d.V. and J.N., unpublished data). The approach uses a leave-one-out procedure to learn how the shift effect depends on the principal-component values and then applies this correction to all projected values. This procedure is not sensitive to the inclusion of related samples, and we were thus able to project the full Sardinian sample. To display levels of allele sharing between populations at different allele frequencies we used a previously described metric29,30.
We searched for evidence of association using EPACTS65, a software that performs a linear mixed model adjusted with a genomic-based kinship matrix calculated using all quality-checked genotyped autosomal SNPs with MAF >1% (599,975 SNPs of the 890,542 total SNPs). The advantage of this model is that the kinship matrix encodes a wide range of sample structures, including both cryptic relatedness and population stratification. As a proof of appropriate adjustment for all confounders, the genomic control was 0.97, 0.99, 0.97, 1.01, 1.01, 1, 1.01, 1 and 1 for the analysis of LDL cholesterol, HDL cholesterol, total cholesterol, triglycerides, ADPN, hsCRP, IL-6, MCP-1 and ESR, respectively. Only additive effects for each allele were considered, and age, age2 and sex were included as covariates in all analyses. Trait measures were normalized by quantile transformation, prior analyses. For the inflammatory traits, we also included smoking status and body mass index (BMI) as covariates13.
To identify sex-specific effects, we first performed GWAS analysis separately for males and females using the same transformation and same covariates (excluding sex) as in the primary GWAS. We then assessed the significance of observed differences by testing the heterogeneity of effect sizes with a χ2 test implemented in METAL66.
Rare variant analysis.
We performed two region-based tests: CMC67 and VT68 tests. Both tests were implemented in EPACTS (see URLs) to account for family relationships in our GWAS. To perform these rare variant tests, we used all nonsynonymous SNPs and variants altering splicing, with MAF <5%. In each test, we assessed 10,000 regions and thus considered a Bonferroni threshold of P < 5 × 10−6 to declare significance.
Calculation of variance explained.
The variance explained by the strongest associated SNPs was calculated for each trait as the difference in R2-adjusted observed in the full and basic models, where the basic model only included phenotypic covariates (age, age2 and sex for lipid level traits; age, age2, sex, BMI and smoking status included for the inflammatory markers) and the full model also included all the independent SNPs associated with a specific trait. Variance for all available SNPs was calculated using GCTA software48, taking account of both closely and distantly related pairs of individuals69. The set of all available SNPs included all quality-checked SNPs after removing those that were monomorphic in the subset of phenotyped individuals (this set is also called the 'accessible genome').
To identify independent signals, we performed GWAS analysis for each trait by adding the leading SNPs found in the primary GWAS as covariates to the basic model. A SNP reaching the classical genome-wide significance threshold (P < 5 × 10−8) was considered a significant independent signal, with the sole exception of rs72658864, which did not reach the threshold but whose association was supported by previous reports.
Estimate of the genome-wide significance threshold in Sardinians.
We defined a threshold for significance that applies to Sardinians when considering whole-genome sequencing data using empirical estimates (R package). We performed analyses in the SardiNIA cohort as well as in a cohort of 2,700 unrelated individuals from the Sardinian case-control study of multiple sclerosis and type 1 diabetes, who have been genotyped using OmniExpress and Immunochip arrays and imputed using the Sardinian reference panel. This additional cohort was used to ensure that there was no bias introduced into the estimation of the threshold by dealing with families in the SardiNIA study. The method consisted in simulating phenotypes under null and running single-marker association tests to calculate the threshold to maintain a family-wide error rate of 5%. Associations were performed for all the SNPs on chromosome 3, and the genome-wide significance threshold was then predicted assuming that the whole genome is approximately 15.6 times longer than chromosome 3 (ref. 70).
For the SardiNIA samples, we simulated 3 sets of 300 normally distributed phenotypes assuming 3 different heritabilities (20%, 40% and 70%) using Merlin (--simul option)71. We assumed no underlying quantitative trait loci (QTLs) among the genotyped and imputed variants. For the case-control study, we simulated 300 normally distributed phenotypes under the null hypothesis of no association. The results were highly comparable among all scenarios (Supplementary Table 6). To obtain a more accurate estimate, we increased the number of simulations up to 1,000 for all the phenotypes (except for the phenotype with 70% of heritability because it is not a typical scenario in GWAS). We then calculated the genome-wide significance thresholds for analyses that aim to test all variants and for those that evaluate only variants with MAF >0.5%. Our estimates led to a significant threshold of P < 6.9 × 10−9 and of P < 1.4 × 10−8 for GWAS with all variants and with only variants with MAF >0.5%, respectively.
We searched for replication of the 2 new signals associated with hsCRP in 7,689 individuals from 8 European cohorts (TwinsUK, FVG, VBI, HA, HP, ALSPAC, INCIPE1 and INCIPE2)72,73,74,75; ESR, MCP-1 and ADPN values were not available in these samples. In TwinsUK and ALSPAC, we analyzed genotypes from whole-genome sequence data76, whereas for the FVG, VBI, HA, HP, INCIPE1 and INCIPE2 cohorts we used genotypes imputed from the 1000 Genomes Project Phase 1 sequencing panel. Specific details on each cohort are provided in the Supplementary Note. Association was evaluated by fitting a linear regression model that included age and sex as covariates, using as software GEMMA (TwinsUK, FVG, VBI, HA and HP) and SNPTEST (ALSPAC, INCIPE1 and INCIPE2) (see URLs). Normalization was not applied to the traits.
Exome Aggregation Consortium (ExAC) Browser, http://exac.broadinstitute.org/; GotCloud, http://genome.sph.umich.edu/wiki/GotCloud; EPACTS, http://genome.sph.umich.edu/wiki/EPACTS; GCTA, http://www.complextraitgenomics.com/software/gcta/; GEMMA, http://www.xzlab.org/software.html; SNPTEST, http://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html; UCSC Genome Browser, http://genome.ucsc.edu/; Genevar eQTL browser, http://www.sanger.ac.uk/resources/software/genevar/; NCBI eQTL browser, http://www.ncbi.nlm.nih.gov/projects/gap/eqtl/index.cgi; Gilad/Pritchard groups eQTL browser, http://eqtl.uchicago.edu/; Exome Picks, http://genome.sph.umich.edu/wiki/ExomePicks; ExomeChip, http://genome.sph.umich.edu/wiki/Exome_Chip_Design; 1000 Genomes Project data repository, ftp://ftp.1000genomes.ebi.ac.uk/; Beagle, http://faculty.washington.edu/browning/beagle/beagle.html.
We thank all the volunteers who generously participated in this study and made this research possible. This research was supported by National Human Genome Research Institute grants HG005581, HG005552, HG006513, HG007022 and HG007089; by National Heart, Lung, and Blood Institute grant HL117626; by the Intramural Research Program of the US National Institutes of Health, National Institute on Aging, contracts N01-AG-1-2109 and HHSN271201100005C; by Sardinian Autonomous Region (L.R. 7/2009) grant cRP3-154; by the PB05 InterOmics MIUR Flagship Project; by grant FaReBio2011 'Farmaci e Reti Biotecnologiche di Qualità'; by a US National Institutes of Health National Research Service Award (NRSA) postdoctoral fellowship (F32GM106656) to C.W.K.C.; and by the UC MEXUS/CONOCYT fellowship to V.D.O.d.V. The replication cohorts acknowledge the use of data generated by the UK10K Consortium, supported by Wellcome Trust award WT091310. The UK10K research was specifically funded by a Wellcome Trust award, '10,000 UK Genome Sequences: Accessing the Role of Rare Genetic Variants in Health and Disease' (WT091310/C/10/Z). The research of N.S. is supported by the Wellcome Trust (grants WT098051 and WT091310), the European Union's Seventh Framework Programme (EPIGENESYS grant 257082 and BLUEPRINT grant HEALTH-F5-2011-282510) and the National Institute for Health Research (NIHR) British Research Council (BRC). The ING-FVG cohort was supported by grant Ministero della Salute—Ricerca Finalizzata PE-2011-02347500 (to P.G.); the ING-VB study thanks the inhabitants of Val Borbera for participating in the study, M. Traglia, C. Sala and C. Masciullo for data management, and the funding sources Fondazione Cariplo (Italy), the Ministry of Health, Ricerca Finalizzata (Italy) 2008, 2011-2012, and the Public Health Genomics Project 2010. The HELIC cohorts are thankful to the residents of the Pomak villages and the Mylopotamos villages for participating and to their funding sources, including the Wellcome Trust (098051) and the European Research Council (ERC-2011-StG 280559-SEPI).
Supplementary Note, Supplementary Figures 1–10 and Supplementary Tables 1–17.
About this article
Nature Genetics (2018)