A rare variant in APOC3 is associated with plasma triglyceride and VLDL levels in Europeans

The analysis of rich catalogues of genetic variation from population-based sequencing provides an opportunity to screen for functional effects. Here we report a rare variant in APOC3 (rs138326449-A, minor allele frequency ~0.25% (UK)) associated with plasma triglyceride (TG) levels (−1.43 s.d. (s.e.=0.27 per minor allele (P-value=8.0 × 10−8)) discovered in 3,202 individuals with low read-depth, whole-genome sequence. We replicate this in 12,831 participants from five additional samples of Northern and Southern European origin (−1.0 s.d. (s.e.=0.173), P-value=7.32 × 10−9). This is consistent with an effect between 0.5 and 1.5 mmol l−1 dependent on population. We show that a single predicted splice donor variant is responsible for association signals and is independent of known common variants. Analyses suggest an independent relationship between rs138326449 and high-density lipoprotein (HDL) levels. This represents one of the first examples of a rare, large effect variant identified from whole-genome sequencing at a population scale.

. Distribution of log transformed and native triglyceride levels with reference to both adult thresholds for hypertriglyceridaemia and HDL levels in children from the ALSPAC study and adults from the 1958 birth cohort.
(a) Histogram of natural log transformed circulating triglyceride levels in ALSPAC participants (aged 9 years). Orange markers highlight the TG values of carriers of rs138326449 minor alleles (all heterozygotes, n=17 out of 5,040 with lipid data). Black diamond marks the mean natural log transformed triglyceride value for carriers with filled diamonds showing 95%CI for this estimate. (b) Vertical line indicates the current accepted (adult) threshold for hypertriglyceridaemia (<150 mg/dL (<1.7 mmol/L)) over-layed with the native distribution of triglycerides in ALSPAC participants and a scatter of their circulating HDL levels. Plots (c) and (d) are the same as (a) and (b), but derived from adults within the 1958 birth cohort (biomedical clinic run at age 42 years). (c) contains the additional stratification of rs138326449 minor allele carrier triglyceride status in the presence and absence of lipid lowering drugs (those adhering to treatment are plotted as red crossed for values and blue markers for summary).

Supplementary Figure 7. Predicted impact of splice donor variant rs138326449 on mRNA maturation.
In the APOC3 rs138326449 wild type (G marked in red) allele the gene is transcribed into mRNA composed of four exons (a). The first codon is located in the exon 2 of the gene whereas the stop codon is in the exon 4. Exon 1 and a part of exon 2 code for 5'UTR. In the APOC3 rs138326449 minor (A marked in red) allele the exon 2 of the APOC3 gene is likely to be skipped because its splicing donor site will no longer be recognised (b). The next ATG codon available is located in the exon 3. Translation from the ATG in the exon 3 will cause a frame shift and a formation of a premature stop codon after 22 amino acids with unlikely functionality. Summary statistics are given for three variants independently associated with TG levels (based on GCTA conditional analysis). †Analysis of splice donor variant rs138326449 conditional on genotype status at the three SNPs suggest an independent contribution to TG in this region. ¥ Indicates loci previously reported in association with plasma lipid-subfractions in genomewide association analysis 3,4 .  Table S5 reports triglyceride specific SKAT results for genes contained within the APOC3 region along with counts of the contributing variants contained within them.

Low read-depth whole genome sequencing (cohorts dataset)
Low read-depth whole-genome sequencing (WGS) was performed at both the Wellcome Trust Sanger Institute (WTSI) and the Beijing Genomics Institute (BGI). DNA (1-3μg) was sheared to 100-1000 bp using a Covaris E210 or LE220 (Covaris, Woburn, MA, USA). Sheared DNA was subjected to Illumina paired-end DNA library preparation. Following size selection (300-500 bp insert size), DNA libraries were sequenced using the Illumina HiSeq platform as paired-end 100 base reads according to manufacturer's protocol.

Alignment and BAM processing
Data generated at the Sanger Institute and BGI were aligned to the human reference separately by the respective centres. The BAM files 1 produced from these alignments were submitted to the European Genome-phenome Archive (EGA). The Vertebrate Resequencing Group at the Sanger Institute then performed further processing.

Alignment
Sequencing reads that failed QC were removed using the Illumina GA Pipeline, and the rest were aligned to the GRCh37 human reference, specifically the reference used in Phase 1 of the 1000 Genomes Project (ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz). Reads were aligned using BWA (v0.5.9-r16) 2 . This involved the following steps: For calling on chromosome X and Y, the following settings were applied. The pseudo-autosomal region (PAR) was masked on chromosome Y in the reference fasta file. Male samples were called as diploid in the PAR on chromosome X, and haploid otherwise. Diploid/haploid calls were made using the -s option in bcftools view. The PAR regions were: X-PAR1 (60,001-2,699,520); X-PAR2 (154,931,044-155,260,560); Y-PAR1 (10,001-2,649,520); Y-PAR2 (59,034,050-59,363,566).

INDEL pre-filtering
The observation of spikes in the insertion/deletion ratio in sequencing cycles of a subset of the sequencing runs were linked to the appearance of bubbles in the flow cell during sequencing. To counteract this, the following post-calling filtering was applied. The bamcheck utility from the samtools package was used to create a distribution of INDELs per sequencing cycle. Lanes with INDELs predominantly clustered at certain read cycles were marked as problematic, specifically where the highest peak was 5x bigger than the median of the distribution. The list of problematic lanes included 159 samples. In the next step we checked mapped positions of the affected reads to see if they overlapped with called INDELs, which they did for 1,694,630 called sites. The genotypes and genotype likelihoods of affected samples were then set to the reference genotype unless there was a support for the INDEL also in a different, unaffected lane from the same sample. In total, 140,163 genotypes were set back to reference and 135,647 sites were excluded by this procedure. Note that this step was carried out on raw, unfiltered calls prior to VQSR filtering.

Site filtering
Variant Quality Score Recalibration (VQSR) 6 was used to filter sites. For SNVs, the GATK (version 1.3-21) UnifiedGenotyper was used to recall the sites/alleles discovered by samtools in order to generate annotations to be used for recalibration. Recalibration for the INDELs used annotations derived from the built-in samtools annotations. The GATK VariantRecalibrator was then used to model the variants, followed by GATK ApplyRecalibration, which assigns VQSLOD (variant quality score log odds ratio) values to the variants. SNVs and INDELs were modeled separately, with parameters given below: The truth set includes sites defined as truly showing variation from the reference (GRCh37). VQSLOD scores are calibrated by how many of the truth sites are retained when sites with a VQSLOD score below a given threshold are filtered out. For SNV sites a truth sensitivity of 99.5%, which corresponded to a minimum VQSLOD score of -0.6804 was selected, i.e. for this threshold 99.5% of truth sites were retained. For INDEL sites a truth sensitivity of 97%, which corresponded to a minimum VQSLOD score of 0.5939 was chosen. Finally, we also introduced the filter p<10 -6 to remove sites that failed the Hardy-Weinberg equilibrium (HWE, 302,388 sites removed) and removed sites with evidence for differential frequency (logistic regression p-value > 1e-2) between samples sequenced at BGI and WTSI (277,563 sites removed).

SNVs INDELs
The VQSLOD score and other annotations from GATK (BaseQRankSum, Dels, FS, HRun, HaplotypeScore, InbreedingCoeff, MQ0, MQRankSum, QD, ReadPosRankSum, culprit) were copied back to the original samtools calls, excluding annotations which already existed in or did no apply to the samtools VCFs (DP and MQ, AC, AN). Each VCF further contained the filters LowQual (a low quality variant according to GATK) and MinVQSLOD (Variant's VQSLOD score is less than the cutoff). All sites that did not fail these filters were marked as PASS and brought forward to the genotype refinement stage.
To investigate the presence of batch effects, we computed the pairwise identity by state (IBS) metrics for a joint dataset of 3621 individuals using an LD-pruned genotype set of 2,203,581 markers and performed a multidimensional scaling analysis (MDS) on 10 dimensions (PLINK,v1.07, options: -indep-pairwise , window size: 5000 SNPs, step: 1000 SNPs, R^2 : 0.2; --mds-plot 10). Given the presence of structure by genotyping batch, we ran a genomewide association analysis for the binary variable "sequencing center" ("BGI" / "SANGER") using a logistic regression model. 335,982 SNPs were associated with batch at an inclusive threshold of p-value <= 0.01 and formed a list that were subsequently filtered out from the genotype set, removing the batch effect due to sequencing centre.

Post-genotyping sample QC
Of the 4,030 samples (1,990 TwinsUK and 2,040 ALSPAC) that were submitted for sequencing, 3,910 samples (1,934 TwinsUK and 1,976 ALSPAC) were sequenced and went through the variant calling procedure. Low quality samples were identified before the genotype refinement by comparing the samples to their GWAS genotypes using approximately 20,000 sites on chromosome 20. Comparing the raw genotype calls to existing GWAS data, we removed a total of 112 samples (64 TwinsUK and 48 ALSPAC) because of one or more of the following causes: (i) high overall discordance to SNP array data (>3%) (55 TwinsUK and 36 ALSPAC), (ii) heterozygosity rate > 3SD from population mean (1 TwinsUK and 1 ALSPAC), suggesting contamination (iii) no SNP array data available for that sample (7 TwinsUK and 0 ALSPAC) and (iv) sample below 4x mean read-depth (1 TwinsUK and 11 ALSPAC). Overall, 3,798 samples (1,870 TwinsUK and 1,928 ALSPAC) were brought forward to the genotype refinement step (the number used in association analysis will of course vary according to phenotypic data available).

Genotype Refinement
The missing and low confidence genotypes in the filtered VCFs were refined out through an imputation procedure with BEAGLE 4, rev909 7 . The program was run with default parameters. VCFs were split into chunks each containing a maximum of 3,000 sites plus 1,000 sites in buffer regions, that is, 500 on either side. Multiallelic sites were included in the imputation. It took 882 CPU weeks to complete. After imputation, chunks were recombined using the vcf-phased-join script from the vcftools [vcftools] package.

Post-refinement sample QC
Additional sample-level QC steps were carried out on refined genotypes, leading to the exclusion of additional 17 samples (16 TwinsUK and 1 ALSPAC) because of one or more of the following causes: (i) non-reference discordance (NRD) with GWAS SNV data > 5% (12 TwinsUK and 1 ALSPAC), (ii) multiple relations to other samples (13 TwinsUK and 1 ALSPAC), (iii) failed sex check (3 TwinsUK and 0 ALSPAC). To identify these samples we pruned the WGS data to a set of independent SNVs and calculated genome-wide average identity by state between each pair of samples across the two cohorts. Samples were removed if they had more than 25 relations with IBS>0.125 (a high number of relationships may indicate contamination). The resulting set of contaminated samples corresponded almost completely to the set of samples with NRD>5%. This left a final set of 3,781 samples (1,854 TwinsUK and 1,927 ALSPAC). These VCF files were submitted to the EGA.
To exclude the presence of participants of non-European ancestry in our dataset, we merged a pruned dataset to the 11 HapMap3 populations 8 and performed a principal components analysis (PCA) using EIGENSTRAT 9 . A total of 44 participants (12 TwinsUK and 32 ALSPAC) did not cluster to the European (CEU) cluster of samples and were removed from association analyses. We further sought to flag related individuals for exclusion in association tests. Although association methods that account for family relatedness are available both for common and rare variant tests 10 , we opted to remove these individuals for ease of analysis for ease and speed of analysis. Overall, 69 samples (36 TwinsUK and 33 ALSPAC) were flagged because of relatedness greater than third degree relatedness (IBD>0.125). Finally 63 co-twin samples (42 dizygotic and 21 monozygotic) and three duplicate samples were removed from TwinsUK. The final sequence data set that was used for the association analyses comprises 3,621 samples (1,754 TwinsUK and 1,867 ALSPAC).

Re-phasing
SHAPEIT2 11 was then used to rephase the genotype data. The VCF files were converted to binary ped format. Multiallelic and MAF<0.02% (singleton and monomorphic) sites were removed. Files were then split into 3Mbp chunks with +/-250kbp flanking regions. SHAPEIT (v2.r727) was used to rephase the haplotypes with the following command line option in phase mode: --thread 4 --window 0. These are the final VCF files released for the project. An imputation reference panel in the IMPUTE2 format created from these VCF files are also made available.

GW SNP array data
For each of the cohorts, we had additional GWA data available. For ALSPAC, 6,557 samples were measured on Illumina HumanHap550 arrays and passed QC (population stratification, sex check, heterozygosity and relatedness (IBS>0.125)). For TwinsUK, 2,575 samples were genotyped on Illumina HumanHap300 or Illumina Human610 arrays. These samples passed QC on relatedness (IBS>0.125), population stratification, heterozygosity, zygosity and sex checks. Samples from the imputed datasets were unrelated to the sequence datasets (IBS>0.125). Variants discovered through WGS of the TwinsUK and ALSPAC cohorts along with those known from 1000 Genomes were imputed into the full genome-wide association study genotyped cohorts increasing the sample size for single point association analysis to 12,724 subjects.

Imputation of UK10K & UK10K+1000GP data into SNP arrays
We imputed into available GWAS data using a combined UK10K as a reference and a UK10K+1000GP panel. We developed new functionality in IMPUTE2 12 that uses each reference panel to impute the missing variants in its counterpart, and then combine the two reference panels at the union set of sites. We tested the three reference panels for imputing three SNP array data, a sub-sample of 1,000 individuals from the UK10K WGS dataset, four European samples (3 CEU, 1 TSI) sequenced by Complete Genomics (CG, depth: 80X) 13 , and an Italian isolate genotyped on core-exome SNP array. Using 3,781 UK10K genomes consistently improved imputation quality compared to using the 1000GP panel alone and whilst the combined UK10K+1000GP panel does not substantively increase imputation accuracy compared to UK10K panel alone, this does increase the total number of imputable variants 14 .
Replication samples ALSPAC GWAS. The ALSPAC GWAS dataset was imputed using the UK10K reference panel and drawn from the same cohort described earlier and includes all cohort participants that were not selected from whole-genome sequencing and for which genome-wide SNP arrays were available. Overall, 6,557 samples not contained in the WGS dataset were genotyped on Illumina HumanHap550 arrays and passed quality control metrics as described in 15 . Lipid measurements were as described earlier.
TwinsUK GWAS. The TwinsUK GWAS dataset was imputed using the UK10K reference panel and includes all participants from the TwinsUK with genome-wide SNP array data and not sharing family relatedness with the WGS dataset. A total of 2,575 study participants genotyped on Illumina HumanHap300 or Illumina Human610 arrays passed QC metrics and were available for analysis 16 . Lipid measurements were as described earlier. The majority of replication samples were fasted before measurement (87%).  Isolates) collection focuses on the Mylopotamos villages on Crete, Greece. Recruitment of this population-based sample was primarily carried out at the village medical centres. All individuals were older than 17 years and had to have at least one parent from the Mylopotamos area. The study includes biological sample collection for DNA extraction and lab-based blood measurements, and interview-based questionnaire filling. The phenotypes collected include anthropometric and biometric measurements, clinical evaluation data, biochemical and haematological profiles, self-reported medical history, demographic, socioeconomic and lifestyle information. The Harokopio University Bioethics Committee approved the study and informed consent was obtained from every participant. Lipid measurements. Lipid traits were assessed using enzymatic colorimetric assays and included; total cholesterol (cholesterol oxidase -phenol aminophenazone method), high-density lipoprotein (HDL)-cholesterol and TGs (glycerol-3-phosphate oxidase -phenol aminophenazone). Low Density Lipoprotein (LDL)-cholesterol levels were calculated according to Friedewald equation 19 . Of the 1282 Helic Manolis individuals with lipid data, 849 were recorded as fasting when the biochemical measurements were taken.

Validation genotyping
For ALSPAC, the entire cohort (10,145 participants, including 38 carriers of the rare A allele) was genotyped using KASP at KBioscience (www.lgcgenomics.com/). Genotyping was undertaken at KBioscience where KASP genotyping was used. Assays are based on competitive allele-specific PCR and enable bi-allelic scoring of single nucleotide polymorphisms (SNPs) and insertions and deletions (Indels) at specific loci. The SNP-specific KASP Assay mix and the universal KASP Master mix are added to DNA samples, a thermal cycling reaction is then performed, followed by an end-point fluorescent read. Bi-allelic discrimination is achieved through the competitive binding of two allelespecific forward primers, each with a unique tail sequence that corresponds with two universal FRET (fluorescence resonant energy transfer) cassettes; one labelled with FAMTM dye and the other with HEXTM dye.
For TwinsUK, genotyping accuracy was evaluated against a dataset comprising of ~250 high-coverage exomes sequenced in overlapping samples 20 . Of the 6 carriers detected in our study, four were overlapping and correctly called also in the exome dataset, yielding a genotyping accuracy of 100%.
There was 100% concordance with the genotypes called from the whole-genome dataset.