Korean Variant Archive (KOVA): a reference database of genetic variations in the Korean population

Despite efforts to interrogate human genome variation through large-scale databases, systematic preference toward populations of Caucasian descendants has resulted in unintended reduction of power in studying non-Caucasians. Here we report a compilation of coding variants from 1,055 healthy Korean individuals (KOVA; Korean Variant Archive). The samples were sequenced to a mean depth of 75x, yielding 101 singleton variants per individual. Population genetics analysis demonstrates that the Korean population is a distinct ethnic group comparable to other discrete ethnic groups in Africa and Europe, providing a rationale for such independent genomic datasets. Indeed, KOVA conferred 22.8% increased variant filtering power in addition to Exome Aggregation Consortium (ExAC) when used on Korean exomes. Functional assessment of nonsynonymous variant supported the presence of purifying selection in Koreans. Analysis of copy number variants detected 5.2 deletions and 10.3 amplifications per individual with an increased fraction of novel variants among smaller and rarer copy number variable segments. We also report a list of germline variants that are associated with increased tumor susceptibility. This catalog can function as a critical addition to the pre-existing variant databases in pursuing genetic studies of Korean individuals.


Results
Quality control. To establish a genetic database of healthy Korean individuals, WES data from normal tissues from cancer patients (675 samples; 472 blood and 203 adjacent normal tissues) and blood samples from healthy individuals with no apparent clinical history (380 samples) were collected. Clinical characteristics of the participants are summarized in Supplementary Table S1. After removing duplicated samples and cryptic relatives, 1,055 individuals remained for further analysis. The mean coverage depth of the runs was 75x (Supplementary  Table S1). In total, we have identified 293,049 variants after vigorous filtering, which constituted the Korean variant archive (KOVA).
The variants were covered 41x on average and mean genotype quality was above 66 (Supplementary Fig. S1a and b), assuring good variant calling quality. As further quality control steps, we checked transition-to-transversion and hetero-to-homozygosity ratios of the variants (Supplementary Fig. S1c and d and Fig. S2), both of which were consistent with the previous report 16 . Since our data set is a collection from five independent groups with different capture and sequencing platforms, careful exclusion of any systematic bias was necessary. Therefore, we checked for the presence of inter-group biases using several criteria. First, the number of called variants per each individual was approximately ~42 K on average and the distribution was almost similar except for individuals from one group with a slightly lower mean value of ~39 K ( Supplementary Fig. S3a), which might be due to using different exome capture kits (Supplementary Table S1). Next, we examined the profile of singleton variants that were seen only once in the set. The number of singletons per each individual was 101 on average and its distribution was stable across different groups ( Supplementary Fig. S3b).
Basic KOVA features and population genetics. Identification of Korean-specific variations and cataloging their frequencies are among the main purposes of the study. Comparison with the dbSNP database (build 147) showed that 205,002 (70.0%) variants were known and 88,047 (30.0%) variants were novel. Distribution of minor allele frequency (MAF) showed enrichment of novel variants in rare variants (45.6% of MAF ≦ 0.01) and the highest in the singletons (61.2%), consistent with previous reports (Fig. 1a, Supplementary Table S2) 5,6 . Common variants of MAF > 0.05 were mostly known variants reported in dbSNP (99.8%). Simulation analysis demonstrates that the common variants reported in the 1000GP are rapidly saturated as the number of KOVA samples increases, suggesting that we are capturing almost all Korean common coding variants using this set, while rare or singleton variants accumulate with minimal overlaps as individual number increases (Fig. 1b). Therefore, KOVA can confer additional contributions to variant filtering in addition to the pre-existing databases when performing human genetics studies as confirmed by a simulation study using an independent Korean WES data set (22.8% additional reduction conferred to the ExAC-filtered variants; Supplementary Fig. S4).
To better understand the relationship of variant profiles between Korean and other populations, we compared the KOVA variants with those from other whole exome or genome sequencing based population studies. Comparison with the UK10K data would delineate the difference between Caucasian and Asian populations, and comparison with the Japanese whole genome sequencing (1KJPN) data is expected to contrast the subtle difference between Korean and Japanese populations 6,8 . Only 36.7% of KOVA variants (23,653/64,428) were commonly found in all three populations, indicating substantial differences by ethnic background (Fig. 1c and Supplementary Fig. S5). From this comparison, the number of UK10K-specific variants were the largest with 397,316 variants, which is expected given its large cohort size. As we applied strict variant filtering strategy, we noted that the number of KOVA-specific variants is markedly lower than that of the Japanese population. Principal component analysis (PCA) revealed that KOVA co-clustered with East Asians (Supplementary Fig. S6a) and located between Northern Chinese and Japanese (Fig. 1d). The PCA plot was in good agreement with the geographical locations of corresponding ethnic groups (Fig. 1d). As expected, fixation index (F ST ) analysis, a parameter of population differentiation, revealed the closest relationship as Korean to Japanese and Chinese over African, European and Southeast Asian ( Fig. 1e and f and Supplementary Fig. S6b and c). Having large genome-wide variant information, we asked how distinct is the Korean population compared to East Asian neighbors and to among the ethnic groups in African or European continents. Calculating F ST among multiple population groups from 1000GP and KOVA reveals a close genetic relationship within each super-population and that KOVA is closely related to EAS in the 1000GP as expected (Fig. 1g). One of the critical questions that can be addressed by this study is whether Korean population is genetically distinct from its neighboring populations, which will provide the rationale for constructing its own variant database. In this respect, we noted that the degree of closeness between KOVA and other East Asian populations, as scored by weighted F ST , is comparable or larger than those between African or European populations ( Fig. 1g-j), demonstrating that the Korean population is a distinct ethnic group among EAS as those from other continents.
Functional analysis of coding variants. Next, we analyzed the functional impact of 64,428 coding variants in KOVA ( Table 1). The portion of novel coding variants was 14.2% and most of the exonic variants were SNVs (95.2%). Short insertions and deletions (indels) are predominantly smaller than seven bases (93.8%), and coding indels are enriched in multiples of three bases, consistent with previous findings (Fig. 2a and Table 1). The novel-to-known ratio is relatively high in functionally significant variants such as frame-shift indels, stop gains and stop losses presumably due to purifying selection. The nonsynonymous-to-synonymous SNVs (N/S) ratio was relatively high in rare variants (Fig. 2b). The nonsynonymous variants tend to display more damaging or pathogenic scores as variants become rarer in the population (i.e., reduced SIFT score, increased PolyPhen-2, scaled C-score (CADD score) and PhyloP scores) ( Fig. 2c and d and Supplementary Fig. S7) [17][18][19][20] , again implying increased variant functionality as they become relatively rare. One important parameter of nonsynonymous variant functionality is how an amino acid residue is conserved throughout evolution. Therefore, we counted the ratio of vertebrate species with different amino acids on orthologous proteins from human as a measurement of evolutionary conservation. The ratio increased (i.e., weaker conservation) for nonsynonymous variant residues as allele frequency increases, whereas the ratio remained consistently low (i.e., strong conservation) for synonymous ones (Fig. 2e). This result suggests that nonsynonymous variants, especially common ones, tend to occur in less well conserved residues, mostly escaped from functional restrictions conferred by amino acid changes. The mean of 117 heterozygous loss-of-function (LoF) variants were detected per individual. A majority of LoF variants are rare (85.3% of all LoF variants are MAF < 1%), and more than half of LoF are singletons (55.8%). Consistent with the observation from a nonsynonymous variant conservation pattern, the relative positions of stop gain variants on proteins were biased to the C-terminal end (Fig. 2f) and most of the stop gain variants were found in genes with tolerable pLI (probability of being LoF intolerant) values (<0.2) 5 . Although the participants of the study are healthy, we compared the KOVA variants to the ClinVar list to test whether there are any carriers of reported pathogenic variants ( Supplementary  Fig. S8). A number of "pathogenic" variants that reach high allele frequencies were detected, supporting the rising concern that a certain portion of previously-tagged pathogenic variants may not be truly pathogenic (26 variants are MAF > 10% in both KOVA and ExAC, Supplementary Fig. S8).

Copy number variations in KOVA.
Although WES is not designed for copy number variation (CNV) detection, the large sample size and recent improvements of bioinformatics tools allow stable CNV detection and analysis 5 . After the initial QC filtering steps, 944 samples remained and were used for CNV analysis. Among the 944 Korean individuals, we found a total of 14,600 putative CNV segments. The average number of CNVs per individual was 10.3 and 5.2 for the amplifications and deletions, respectively (Supplementary Table S3). Most of the called CNV segments were shorter than 10 kb (i.e. segments covering two or three exons), as larger CNVs tend to pose higher odds of conferring pathogenicity (Fig. 3a). We compared our CNV segments with control list of CNVs from the database of genomic variants (DGV) (see Materials and Methods). About 10% of all KOVA CNV segments were not found in DGV and were called as novel. These novel CNV calls tend to be smaller in size (78.1% smaller than 10 kb) and found in rare frequencies ( Fig. 3a and b). Chromosome 19 was known to display the highest gene density among the human chromosomes 21 , and contained the highest numbers of CNV calls after adjusting for size, consistent with other WES-based CNV studies ( Supplementary Fig. S9) 22 . To find highly copy number polymorphic genes in the Korean population, we counted the frequency of copy number changes by sorting genes by sum of amplification and deletion frequencies (Fig. 3c). The list includes well-known polymorphic loci such as amylase and HLA genes, and SIGLEC14 was the most highly copy number polymorphic locus while SIGLEC5, which is located just upstream of SIGLEC14 is mostly copy-neutral ( Fig. 3d and Supplementary Figs S10 and S11). A previous study reported that deletion polymorphism of SIGLEC14 is higher in Asians compared with Africans and Europeans 23 . We also observed this Asian-specific high frequency of SIGLEC14 deletion in DGV, although its functional implication remains elusive (Fig. 3e). Finally, a subset of KOVA subjects (n = 208) were also analyzed by SNP arrays and their CNV profiles are well-correlated with WES-based KOVA profiles (Correlation = 0.43, Supplementary Fig. S12).
Potential role of rare germline variants on tumor susceptibility. Having tumor-paired normal samples in our cohort provided an opportunity to explore the potential role of rare germline variants in cancer development. After selecting 54 and 72 rare variants for lung and stomach cancers (Supplementary Tables S4 and S5), that were enriched in "tumor-paired normals" compared to "healthy normals" (see Materials and Methods), we observed a similar difference of allele frequencies between the Cancer Genome Atlas (TCGA) and 1000GP normal datasets (P = 0.018 and P = 0.003 for lung and stomach adenocarcinomas, respectively; Fig. 4). Our results indicate that certain SNVs in our tumor-derived cohorts may potentially function as predisposing factors for tumorigenesis.

Discussion
To our knowledge, a WES-based genomic variant catalog from Korean individuals of this size has never been previously reported. We have demonstrated the rationale of constructing an independent Korean genomic database by showing that genetic distances between KOVA and other East Asian ethnic groups are comparable or even farther than those between the ethnic groups of Africa or Europe (Fig. 1g-j). In East Asia, one of the most genetically similar ethnic groups to Koreans is Japanese, which is well supported by historic evidence that people have colonized the Japanese archipelago through the Korean Peninsula about 40,000 years ago. The high-quality coding variants that are predicted to change protein sequences followed signatures of purifying selection. Although we need to further investigate functional implications and multi-ethnic comparative profiling of Korean copy number variations, we called 14,600 CNVs and demonstrate that they also followed the restrictions posed by purifying selection. We also propose a group of rare functional variants that may regulate cancer susceptibility and validated their consistent behaviors using European-based TCGA and 1000GP databases. Further validation of these variants is required with larger independent cohorts and performing functional analyses.
There are limitations in this study, one of which is our strategy of applying stringent variant filtering criteria, which resulted in calling fewer variants compared to others. As the number of rare variants continued to increase as we added more samples (Fig. 1b), we are still limited in covering rare variants using this cohort size. Nevertheless, this study cataloged the largest healthy Korean cohorts and we found that most of the common coding variants were well covered by this set.
WES-or WGS-based sequencing efforts have become more commonplace over recent years and will continue to do so in the near future, thus we anticipate that expanding KOVA with new participants will ensure that the archive remains a valuable database for pursuing disease-based, population or evolutionary genetic studies of Korean individuals.

Methods
Cohorts and sample preparation. We collected WES data of Korean individuals from five independent research groups. All sequencing data were obtained from normal tissues or blood samples following standard protocols (Supplementary Table S1). This project was performed with approval of the Institutional Review Board of each group (Seoul National University, Ewha Womans University, Asan Medical Center, and Samsung Medical Center), in which all donors provided written informed consent. All the experiments were performed in de-identified status and in accordance with relevant guidelines and regulations. Variant calling and filtering. The raw sequencing data were analyzed with in-house pipeline ( Supplementary   Fig. S13) to combine data from different exome capture platforms. Briefly, BWA (version 0.7.5a) was used to map short reads to hg19. GATK (version 2.4-7) was used for local realignment and recalibration after duplicate marking with Picard (version 1.93) 24 . GATK UnifiedGenotyper was used to call variants across all samples simultaneously in the multi-sample calling mode, which allowed us to distinguish whether no variant calls indicated homozygous reference or missing calls due to low coverage. The quality score was further recalibrated using GATK's VQSR model. To obtain a reliable list of variants suitable for population genetics studies, we applied extensive filtering steps as per the following: The minimum genotype quality and depth of coverage were set to 30 and 10, respectively. Then variants with missing genotypes in more than 30% of all individuals were excluded from further analyses. We also removed variants that violate Hardy-Weinberg equilibrium on allelic frequency (P < 10 −6 ).   26 . F ST was estimated with vcftools using-weir-fst-pop option 25,27 . Variant level analysis was performed with a 0.5 million bp window size (option:-fst-window-size 500000) and 0.5 million bp step size (option:-fst-window-step 500000). Windows containing less than three variants were excluded from the subsequent analyses. For gene level analysis, we estimated F ST for individual variants and then assigned them to each gene according to intervals annotated in the GTF file of GENCODE 19 28 . Genes with over ten variants were included in the analysis.  Copy number variation analysis. We called CNVs with CODEX software using default settings 31 . To adjust for possible variations derived from different exome capture kits, we applied the algorithm to each group separately and then combined the results. Known CNVs were downloaded from the DGV (Database of genomic variants, http://dgv.tcag.ca) and calls from 2009 and onwards that were generated using WGS or SNP array platforms were selected for subsequent analyses. The overlap between the DGV variants and KOVA CNV segments were calculated with bedtools using −r 0.5 option (i.e., 50% overlap) 32 . Notable CNV segments were manually checked using the Integrative Genomics Viewer (IGV). ExAC copy number data were downloaded from the ftp site (release 0.3.1, ftp://ftp.broadinstitute.org) 5,22,33 . DGV dataset ID:gssvL59302 was used for global SIGLEC14 copy number profile.
In order to validate the KOVA CNV from an independent platform, we performed CytoScan HD array scanning on 208 samples. Briefly, raw CEL files were processed by apt-copynumber-cyto (1.18.2) from Affymetrix Power Tools (APT). Details are described in APT manual (http://media.affymetrix.com/support/developer/powertools/changelog/apt-copynumber-cyto.html). Then the KOVA absolute copy number and array signal intensity were compared after exclusion of non-overlapping probes.
Assessing roles of rare germline variants on tumor susceptibility. The SNVs from the tumor-paired normal individuals ("tumor-paired normal", 364 and 76 samples for lung and stomach adenocarcinoma, respectively) and healthy normal individuals ("healthy normal", 134 samples) were filtered from a single collective VCF file using the VCFtools-keep option 25 . MAF of the SNVs from tumor-paired normals was compared to that of the SNVs from healthy normals. Protein-altering SNVs that exhibited MAF < 0.01 and 1.5 fold greater MAF in the tumor-paired normals than the heathy normals were selected for further analysis. These SNVs were annotated using ANNOVAR for four functional prediction parameters (SIFT, PolyPhen-2, MutationAssessor and GERP++) 30,34 . The SNVs that were predicted to be deleterious in two out of the four prediction tools were selected as potentially interesting candidates of tumor susceptibility.
For a validation test, the tumor-paired normal data of the lung (LUAD, n = 229) and stomach (STAD, n = 137) adenocarcinomas from TCGA was obtained after excluding non-whites using the analysis pipeline outlined in Supplementary Fig. S13. The SNV data from the 504 East Asian (EAS) and 503 European (EUR) individuals in 1000GP were obtained as independent healthy normal cohorts using VCFtools to compare with tumor-paired normals in KOVA and TCGA, respectively 3, 4, 25 . The selected SNVs from the discovery phase was compared with those of EAS individuals in 1000GP and between those of TCGA and EUR individuals in 1000GP, and the significance was assessed using Wilcoxon's rank sum test.