Whole-exome sequencing in an isolated population from the Dalmatian island of Vis

We have whole-exome sequenced 176 individuals from the isolated population of the island of Vis in Croatia in order to describe exonic variation architecture. We found 290 577 single nucleotide variants (SNVs), 65% of which are singletons, low frequency or rare variants. A total of 25 430 (9%) SNVs are novel, previously not catalogued in NHLBI GO Exome Sequencing Project, UK10K-Generation Scotland, 1000Genomes Project, ExAC or NCBI Reference Assembly dbSNP. The majority of these variants (76%) are singletons. Comparable to data obtained from UK10K-Generation Scotland that were sequenced and analysed using the same protocols, we detected an enrichment of potentially damaging variants (non-synonymous and loss-of-function) in the low frequency and common variant categories. On average 115 (range 93–140) genotypes with loss-of-function variants, 23 (15–34) of which were homozygous, were identified per person. The landscape of loss-of-function variants across an exome revealed that variants mainly accumulated in genes on the xenobiotic-related pathways, of which majority coded for enzymes. The frequency of loss-of-function variants was additionally increased in Vis runs of homozygosity regions where variants mainly affected signalling pathways. This work confirms the isolate status of Vis population by means of whole-exome sequence and reveals the pattern of loss-of-function mutations, which resembles the trails of adaptive evolution that were found in other species. By cataloguing the exomic variants and describing the allelic structure of the Vis population, this study will serve as a valuable resource for future genetic studies of human diseases, population genetics and evolution in this population.


INTRODUCTION
Recent advances in genotyping and sequencing technologies have opened a route to a new dimension of population studies, enabling the development of clear insight into the past of any population and corroboration of existing evidence from the domains of palaeontology, archaeology and historical evidence with unprecedented validation and precision. This type of analysis is interesting not only on the global scale, 1 but also on a local scale, particularly in the case of special and isolated populations. Such populations may retain their genetic isolation and uniqueness due to a number of possible factors, including geographical, ethnic or linguistic barriers, and have been estimated to encompass over 11.5 million individuals in Europe alone. 2 The Croatian Adriatic islands are geographically isolated, habitatunique localities characterized by distinctive population histories, which include varying founding times and consequent population age, severe plague bottlenecks and massive waves of emigration due to deteriorating economical conditions. 3 The genetic structure was also affected by historic events, such as near annihilation of the island populations in the conflicts with the ancient Roman Empire, a shift caused by the massive Slavs arrival in~700 AD and clashes with the Ottoman Empire. [3][4][5][6] Genetic studies revealed a reduction of haplotype diversity in Vis compared to an outbread population from Scotland and presented a founder effect in Vis mtDNA sequences. 7 Besides drift and increased homogeneity, isolated populations also tend to harbour unique rare variants, 8 making them very useful tools in genetic association studies. 9 These properties have contributed to the development of the 10 001 Dalmatians resource (http://www.mefst.hr/ default.aspx?id = 826), the largest research-oriented biobank in Croatia, now commonly used in genetic studies worldwide. 10 The aim of this study is to perform an exhaustive exploratory analysis of the exomic structure of the modern-day population of the island of Vis. By cataloguing exomic variants, investigating their frequencies and functional effects, and examining autozygosity, we describe the allelic architecture of this isolated population, providing the basis for a valuable resource for future genetic studies focusing on disease susceptibility, population genetics and evolution.

Participant recruitment and sample collection
We conducted exome-wide sequencing of 193 individuals from the isolated population of the island of Vis, Croatia. The sample for this study was based on the initial cohort of 1026 participants, which were initially recruited in the CROATIA-Vis study (the 10 001 Dalmatians project) between 2003 and 2004. 11 The participants were recruited on the basis of vital registries, postal invitations and other means of invitations. In total, 193 participants (38% men) were selected for the purpose of this study on the basis of three criteria. The first criterion was that a participant originated from the island of Vis, which was verified by the Parish records that were reconstructed for the period of 1850 onwards, and later corroborated by the genealogical records provided by the subjects. Second, we used the participants whose DNA was successfully extracted and who were not identified as genetic outliers in the initial CROATIA-Vis study by principal components analysis of genome-wide, Illumina HumanHap300 (San Diego, CA, USA) genotypes. Last, we used ANCHAP, a method for detecting identity by descent in isolated populations, to select a sample of participants which maximised the whole sample haplotypes representation. 7 Mean genomic kinship of the sample was 0.002, as estimated by the KING algorithm.
All subjects were asked to provide written consent, after being informed on the study goals and main approaches, in accordance with the Declaration of Helsinki. The study was approved by the ethics committees of the University of Zagreb (No. 018057) and the University of Split School of Medicine (No. 2181-198-A3-04110-11-0008), Croatia and the Multi-Centre Research Ethics Committee for Scotland (No. 01/0/71).

Exome capture and sequencing
Sequencing was performed at the Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK. The exomes were captured from blood genomic DNA, using SureSelect Human All Exon SeqCap pulldown technology and were sequenced with 75 bp paired-end reads on Illumina HiSeq platform according to manufacturer's protocol. The capture kit used for exome enrichment was Agilent's SureSelect Human All Exon 50 Mb (Agilent, Santa Clara, CA, USA), targets 51.8 Mb of human genome by design and encompasses unified set of coding exons annotated on August, 2010 by GENCODE or CCDS databases (including 10 bp of flanking sequence for each consensus coding DNA sequence). The probe design also includes small non-coding RNA regions annotated by miRBase v.13 (Manchester, UK) and Rfam (Hinxton, UK) databases, and overall, reports a 100% breadth of coverage.
Compared to the CCDS database, the total of 653 828 probes tile to 97% of targeted bases. If regions captured upstream and downstream of targets are considered, the kit targets 90.5 Mb of genome, providing additionally a broad coverage of non-coding DNA in exon-flanking regions (promoters and untranslated regions (UTRs)).

Variant calling and annotation
Details of the workflow used for single nucleotide variant (SNV) calling on the exome data are given in Supplementary material. Called SNVs were annotated with dbSNP137 rsIDs and 1000Genomes super population allele frequencies that were extracted from the final 1000Genomes Phase 1 integrated (v3) callset. Functional annotations were called with the Ensembl Variant Effect Predictor v2.8 against Ensembl 70, which provided coding consequence predictions and SIFT, PolyPhen and Condel annotations as well as GERP conservation scores (http://www.ensembl.org/info/docs/tools/vep/index.html).
The same calling and annotation protocol was also applied to the Generation Scotland UK10K_OBESITY_GS whole-exome data, that was used for comparison with Vis dataset (UK10K-GS-release 2012-11-27 variant dataset; after the initial quality control dataset included n = 377 samples from Scotland ascertained on the basis of body mass index (BMI)440 or pedigrees discordant for BMI). Variants in the UK10K-GS dataset were generated by whole-exome sequencing (WES) as part of the UK10K project that used the same targeted regions, sequencing protocols and downstream bioinformatics pipeline as our study.
The generated WES data have been submitted to the European Genomephenome Archive (https://www.ebi.ac.uk/ega/home) with the accession numbers EGAS00001000336 and EGAD00001001387. The median value of aligned-read depth on the target regions per subject was 110 × (78-193) with 51.5 Mb (99.5% of target regions) covered by at least 30 × , and 39.9 Mb (77.2%) covered by at least 50 × . For sequences falling outside the target but within the calling regions, 29.2 Mb (75.3% of out-target calling region) were covered by at least 30 × read and 15.6 Mb (40.2%) were covered by at least 50 × reads. The analysis of aggregate on target T i /T v ratio implied that the calling pipeline produced a good quality variant callset (Supplementary material, T i /T v ratio).

Performance of whole-exome sequencing
Standard sample and SNV quality controls were performed on the Vis and the UK10K-GS multi-sample called exome-sequence datasets, and are described in detail in Supplementary material. Close relatives were removed from both samples based on pairwise identity by descent sharing. Only bi-allelic SNVs were included in the study.

General description of Vis exome sequence
We examined the distribution of all SNVs according to internal, sample specific allele frequency categories. We also examined the proportion of functional effects in each allele frequency category. For description purposes we used the most deleterious effect of the variant according to severity estimated by Ensembl (http://www.ensembl.org/info/docs/variation/predicted_data.html). For the data representation in the main text we have summarised Ensembl consequence annotations into fewer categories (Supplementary Table 1).
To calculate dN/dS ratios we determined the number of non-synonymous (missense, stop_lost, stop_gained or initiator_codon variants) and synonymous (synonymous and stop_retained variants) substitutions and adjusted the values for multiple substitutions. Only substitutions within known protein-coding DNA sequence were analysed.
Autozygosity was detected through runs of homozygosity (ROH) on a set of common, independent SNVs. Owing to reduced SNV density in such set (~50 k), only long ROHs (45 Mb) were analysed. Calls were made in PLINK v1.07 (http://pngu.mgh.harvard.edu/Bpurcell/plink/) using parameters optimised for detection of ROHs in WES sequences. 12 ROH hotspots and coldspots were identified if the SNP-wise ROH frequency was above or below the 95%th percentile of the level of sharing, respectively.

Loss of function variants
SNVs predicted by the Ensembl Variant Effect Predictor as stop-gained (nonsense) or splice site-disrupting (splice donor or acceptor) SNVs were defined as loss of function (LoF) variants. In total 1775 putative LoF variants were obtained in Vis. Potential LoF variants that were likely due to reference errors or annotation artefacts were filtered. SNVs were marked as false positives if the inferred LoF allele was also the ancestral state (73 SNVs in Vis and 115 in UK10K-GS) indicating that a gain-of-function allele was the recent mutation on a site. We also excluded LoF SNVs (24 in Vis and 17 in UK10K-GS) that were identified as major allelic variant (MAF40.5) in our samples, and in all super populations from the 1000Genomes project: European (EUR), Asian (ASN), African (AFR), Ad Mixed American (AMR). At these sites the reference genome differed from majority of analysed humans indicating a probable reference genome error. In example, at these sites we observed extremely high frequency of homozygous LoF variants (0.26-0.89) (Supplementary Table 2). After filtering, there were 1678 remaining LoF variants in Vis, and 3149 in the UK10K dataset.
The analysis of over-representation of Gene Ontology (GO) annotations and the pathway analysis were performed on a set of genes containing LoF variants using ConsensusPathDB-human database (http://consensuspathdb.org/). 13 False discovery rate q-value of 0.1 and a P-value of 0.05 were used to identify over-represented functional groups.

General description of Vis exome sequence
Overall, the performance of whole-exome sequencing with high percentage of reads uniquely mapped to target regions, with alignedread depth on target regions of 110 × , and 51.5 Mb (99.5%) of target regions covered by at least 30 × ; indicated efficient targeted sequencing.
We found 290 577 SNVs of which 65% were singletons, low frequency or rare (Table 1, more detailed category definitions are provided in Supplementary Table 3). Rare variants (MAF ⩽ 0.01) were the most prevalent in both LoF and nonsynonymous variants, with the percentage of rare SNVs among LoF variants significantly outnumbering the corresponding percentage among nonsynoymous SNVs. On contrary, the percentages of both low (0.01oMAF ⩽ 0.05), and common (MAF40.05) frequency variants among nonsynonymous SNVs was significantly higher than among LoF SNVs. Relative to UK10K-GS, the Vis sample shows a depletion of rare variants and excess of low frequency and common variants (Supplementary Figure 2b). The comparison of potentially deleterious variants (non-synonymous and LoF) showed the same pattern, that is, a decrease of rare, and an excess of low and common frequency variants in Vis (Supplementary Figure 2b). This difference was not the consequence of different sample sizes. When we randomly re-sampled (n = 100) the same number of individuals from each population and compared allele frequencies of variants shared by both populations, again a decrease in proportion of shared-rare variants (p 7.1 × 10 − 27 ), and increase in proportion of shared low-frequency variants (p 7.0 × 10 − 20 ) were observed in Vis (Supplementary Figure 2c). Moreover, frequency of shared variants declared as rare or low frequency in Vis was significantly increased compared to the same variants in UK10K-GS. Median difference in allele frequency per shared variant was 0.0008 for rare and 0.003 for low-frequency variants, which corresponds in both cases to 8% of the total frequency range of particular frequency category (one sample U-test for median of differences against zero, P ⩽ 3.2 × 10 − 20 ).
The average genome-based kinship in Vis sample was low (0.002) indicating that, as expected, unrelated individuals were selected in the study. On the other hand, the average individual autozygosity in Vis that was derived from long ROHs (45 Mb) was consistent with second-cousin relationship, F ROH45Mb 0.017, standard error of mean (SEM) 0.004. 14 Moreover, in comparison to UK10K-GS (F ROH45Mb 0.001, SEM 0.0007), a significantly higher proportions of individuals with long ROHs were found in Vis (1 versus 11% in Vis, χ 2 , P = 4.34 × 10 − 7 ). Median lengths per affected individual in Vis of total and per segment ROH length were 10 Mb (range, 7-35) and 9 Mb (7-22), respectively, whereas the corresponding values in UK10K were 9 Mb (5-16) for both parameters.

Novel variants
We identified a total of 108 615 (37%) novel variants in the Vis exome sequence data that were not present in UK10K-GS, 136 617 (47%) novel variants not present in NHLBI, 114 176 (39%) novel variants not present in ExAC, 60 345 (21%) novel variants not present in the 1000Genomes, 34 497 (12%) novel variants not present in dbSNP and a total of 25 430 (9%) novel variants not present in any of the above mentioned datasets (Supplementary Figure 5). The count of novel variants by MAF for comparisons of Vis exome data with UK10K-GS, NHLBI, ExAC, 1000Genomes and dbSNP (Supplementary Table 4) clearly shows that the majority of novel variants belong to the rare allele frequency spectrum (up to MAF 0.02). 1 As expected, the greatest number of singletons and doubletons are found in dbSNP (78%), followed by comparison with ExAC (58%), then 1000Genomes (54%), NHLBI (43%) and UK10K-GS (31%). A larger proportion of common allele novel variants (~40% of all common variants) was identified through comparisons with NHLBI or ExAC datasets as these variants are mostly intronic (⩾69%) and are simply not present in datasets, which are limited to exome sequence only. We can also see that there are very few common novel variants identified through comparison with 1000Genomes (0.01%) and dbSNP (0.001%), which are databases with a very good capture of common variation (Supplementary Table 4). Owing to the intronic regions within the exome calling for the Vis sample, the largest effect category group of novel variants in all five comparisons is intronic, followed by either non-synonymous (UK10K-GS, 1000Genomes, dbSNP) or non-coding RNA (NHLBI, ExAC) groups (Supplementary Figure 5). A detailed presentation of the number of both summarised and the full set of functional effects of novel variants by MAF can be found in Supplementary material (Supplementary Tables 5 and 6; Supplementary Figures 6 and 7).
A detailed view of the allele frequency/functional effect distributions of completely novel variants (Table 1 and Figure 2) show that 79% of these were singletons and an additional 17% were rare or low frequency, highlighting the ability of WES to identify rare and lowfrequency variants. On the contrary, only 59 completely novel variants

Exome sequencing in Vis
A Jeroncic et al (0.23%) were common suggesting that the discovery of a large number of common novel variation through exome sequencing is highly unlikely. Of completely novel variants, 52% were intronic, 18% were non-synonymous, whereas percentages of non-coding RNA, synonymous and UTR variants were comparable (from 7 to 9%) indicating that in comparison to distribution of all variants, number of completely novel synonymous variants is substantially decreased (Figures 1 and 2).
As expected, the largest proportion of LoF variants in Vis belonged to the singleton category, whereas the proportions of doubleton, low frequency and common variants were comparable (Supplementary Figure 2a). Compared to LoF variants in UK10K_GS sample, there was a depletion of LoF singletons (two-proportion z-test, P = 0.002), overall rare LoF variants (P = 1 × 10 − 6 ) in Vis, and an excess of low frequency (P = 1 × 10 − 5 ) and common variants (P = 0.016) (Supplementary Figure 2b). A similar finding is observed when we compared allele frequency distributions of all variants shared by the two datasets or when we re-sampled the same number of individuals from each population and compared distributions of shared LoF variants.
Of the mapped genes containing LoF variants (n = 1451), 1355 and 786 genes were found in at least one GO-term category or one pathway, respectively. Pathways that were significantly overrepresented included 52 identified genes, of which the vast majority were related to xenobiotic-(metabolization or activation; 27 genes or 52%) and/or to lipid metabolism (30 genes or 58%) ( Table 2). The Table 2 Genes with LoF variantssummary of predicted gene product function and location using gene ontology terms and pathway analysis most significant over-represented GO terms were: ion binding (GO:0043167), hydrolase (GO:0016787) or oxidoreductase activity (GO:0016491), with most genes assigned to these three categories being associated with enzyme activity (61%). We analysed in more detail a subclass of LoF variants exhibiting high-MAF variability across Vis and the 1000Genomes Project super populations. In total, we identified nine LoF variants belonging to this subclass (Table 3). Unlike the majority of LoF variants that are expected to be rare and to have limited geographic distribution, the allele frequency of these variants ranged from rare to common across different populations.

LoF variants in ROH regions
The distribution of LoF variants across an exome with respect to ROH hotspots and coldspots regions has shown an increase in frequency of LoF variants in hotspots, which was significant at α = 0.1 (OR 1.18, P = 0.083). The pathway and GO-term analyses of genes containing the hotspot LoF variants have shown intriguing results. Of the 92 mapped genes carrying a LoF variant in a ROH hotspot region, 59% were assigned to over-represented GO terms, all of which were exclusively membrane-related, either by the location, or the membrane-related function or process (Table 4). Moreover, with regard to processes or functions assigned to these terms, as a rule they were included in some sort of cell's response to either internal or external signal. Similar was found with the pathway analysis, which showed that of 48 mapped genes 53% belonged to the over-represented pathways that were mainly related to cell's response to environmental signals, such as xenobiotics, odorant molecules or allografts. Majority of identified genes in over-represented families (76%) belonged to cytochrome P450 (CYP) or solute-carrier (SLC) gene superfamilies, olfactory receptors (OR) or to HLA gene family.

DISCUSSION
The purpose of this work was to provide a comprehensive insight into the exomic structure of the isolated population of the Adriatic island of Vis and to compare it with the UK10K-GS exome dataset, and with reference databases (NHLBI, 1000Genomes, dbSNP and ExAC). Our data support the findings that the population of Vis is a true genetic isolate. In comparison with the UK10K-GS exomes, we see a depletion of rare and an excess of low frequency and common variants, which is suggesting that the population of Vis had undergone a bottleneck in which the majority of rare variants have vanished but those that did stay in the population have risen in frequency. We see there is a burden of potentially deleterious variants in a low and common allele frequency groups in Vis compared to UK10K-GS (Supplementary Figure 2b). This is in line with several recent studies that identified an excess of deleterious variants in the low-frequency range. [16][17][18][19][20] Also, it was already shown that there are more deleterious variants in European than in African populations due to a long bottleneck effect during out of Africa migrations. 21 Finally, although only unrelated individuals were included in our study, when Vis sample was compared to the UK10K-GS sample, it had higher prevalence of individuals affected with long ROH regions and had higher F ROH 45Mb index. In fact, the value of F ROH index in Vis was similar to the value observed for unrelated sample of Orkney inbread population. 14 Recent population genetics studies indicate that rapid growth increases the load of rare variants 22,23 and likely plays a role in the individual genetic burden. 23 Furthermore, it was demonstrated in a simulation study that population growth dramatically increases the number of deleterious sites in the population and increases the deleterious burden carried by each individual by~6%. 24 Following these findings, one might expect that a population with slower population growth may have a relatively smaller load of rare variants and deleterious sites. Thus, the slightly lower average number of LoF variant genotypes per Vis genome compared to UK10K-GS genome could, at least in part, be due to limited population growth of the old isolate on Vis, 3 compared to the UK10K-GS population.
Estimates on average abundance of LoF genotypes per genome in Vis (115) or UK10K-GS (122) were somewhat larger than the value of 110 estimated from the pilot phase of the 1000Genomes Project. 25 The pilot-phase value was based on a highly curated LoF variant list and assessed around 100 LoF variants per individual with~20 variants in homozygous sites, 15 suggesting slight over estimation of individual number of LoF genotypes in our sample, possibly due to a less strict filtering procedure.
We detected a~6% difference in the average per genome number of LoF variant genotypes between Vis and UK10K-GS. This small difference persisted (~6%) after singleton and doubleton variants (those most affected by differences in sample size 22 and also the most abundant LoF variants) were removed in order to investigate the effect of sample size differences on estimates of individual genetic burden in Vis (n = 176) and UK10K-GS (n = 377).
Genes containing LoF variants in Vis were over-represented in the xenobiotic-and lipid-metabolism pathways. Having in mind that the identified lipid pathways are actually involved in the biosynthesis of known endogenious xenobiotics (steroids, eicosanoids, fatty acids), over-represented genes in Vis were almost entirely xenobiotic-related suggesting that LoF mutations may be accumulating in genes controlling the cell's reaction to foreign molecules from the environment. The finding is in line with a most recent view that gene losses are not necessarily evolutionary disadvantages, but can also contribute to selective advantage in humans. 26 In bacteria, in constant nutrientlimited environment, LoF mutations have been shown to enhance fitness by disproportionately affecting enzymatic and regulatory pathways, 27 similar to what was found in our study. In addition, over-represented genes found within ROH hotspot regions in Vis were predominantly involved in some sort of external or internal signalling.