The Korea Biobank Array: Design and Identification of Coding Variants Associated with Blood Biochemical Traits

We introduce the design and implementation of a new array, the Korea Biobank Array (referred to as KoreanChip), optimized for the Korean population and demonstrate findings from GWAS of blood biochemical traits. KoreanChip comprised >833,000 markers including >247,000 rare-frequency or functional variants estimated from >2,500 sequencing data in Koreans. Of the 833 K markers, 208 K functional markers were directly genotyped. Particularly, >89 K markers were presented in East Asians. KoreanChip achieved higher imputation performance owing to the excellent genomic coverage of 95.38% for common and 73.65% for low-frequency variants. From GWAS (Genome-wide association study) using 6,949 individuals, 28 associations were successfully recapitulated. Moreover, 9 missense variants were newly identified, of which we identified new associations between a common population-specific missense variant, rs671 (p.Glu457Lys) of ALDH2, and two traits including aspartate aminotransferase (P = 5.20 × 10−13) and alanine aminotransferase (P = 4.98 × 10−8). Furthermore, two novel missense variants of GPT with rare frequency in East Asians but extreme rarity in other populations were associated with alanine aminotransferase (rs200088103; p.Arg133Trp, P = 2.02 × 10−9 and rs748547625; p.Arg143Cys, P = 1.41 × 10−6). These variants were successfully replicated in 6,000 individuals (P = 5.30 × 10−8 and P = 1.24 × 10−6). GWAS results suggest the promising utility of KoreanChip with a substantial number of damaging variants to identify new population-specific disease-associated rare/functional variants.

Four hundred healthy volunteers were recruited to participate in the Korean Reference Genome project, and peripheral blood samples were collected. Written informed consent was obtained from all participants. Genomics DNA from the participants were whole-genome sequenced using Illumina HiSeq 2000 platform. Raw reads were aligned on hg19 reference genome using Burrows-Wheeler Aligner (BWA) with default parameters 1 . PCR duplicates were removed using Picard 2 . Resulting BAM files were pre-processed using Genome Analysis Toolkit (GATK) 3 . IndelRealigner was used for realigning reads near short indels and base quality was recalibrated. The pre-processed BAM files were analyzed to identify variants via SNP calling pipeline and LD-aware calling of Genome on the Cloud (GotCloud) pipeline 4 . Three hundred and ninety-seven unrelated samples were used for further analysis based on pairwise identifyby-state analysis of 400 samples. Average read depth per sample range from approximately 10 ~ 30x. As a result, about 20,700,000 variants were discovered.

-Korean samples from T2D-GENES consortium(n=1,087)
By the Type 2 Diabetes Genetic Exploration by Next-generation Sequencing in Ethnic Samples (T2D-GENES) Consortium 5 , approximately 10,000 exomes from five ethnic groups were sequenced using Agilent SureSelect Human Exon v2 44M (Agilent Technologies, Santa Clara, CA) at the Broad Sequencing Center. A portion of the samples was from the KARE project 6 , including 538 unrelated type 2 diabetes samples and 579 control samples, and 1,087 samples were used for further analysis after sample quality control. Sequence data were analyzed using Picard, BWA, and GATK pipelines [1][2][3] . As a result, 500,821 autosomal variants were obtained from the 1,087 Korean Exome sequenced samples.
-Ansung and Ansan study(n=200) and Cardiovascular disease sequencing study(n=200) Hundred healthy individuals with normal blood pressure level (Systolic Blood Pressure (SBP) 90~119, Diastolic Blood Pressure (DBP) 60~79) and 100 individuals with high blood pressure level (SBP >= 140, DBP >= 90) were randomly selected from the KARE project 6,7 , and 200 cardiovascular disease patients were selected from the Genomics Research in Cardiovascular Disease (GenRIC) 8,9 . Peripheral blood samples were collected from the individuals with written informed consent. Genomic DNA was then extracted from the blood samples, and each DNA sample was used for exome enrichment using the Agilent SureSelect Human Exon v2 44M (Agilent Technologies, Santa Clara, CA). All samples were sequenced using Illumina HiSeq 2000 platform. All samples were analyzed together by BWA (alignment on hg19 reference genome), Picard (remove PCR duplicates), and GATK (realignment, recalibration and genotype calling) 1-3 . Average read depth was approximately 60x and approximately 367K variants were identified.

-Korean Children and Adolescents Obesity Cohort Study(n=692)
Study subjects, aged from 12 to 15 years old, were recruited from the Korean Children and Adolescents Obesity Cohort study 10 . Informed parental consents of enrolled children were obtained. Genomic DNA was used for exome enrichment using the Agilent SureSelect Human Exon v4 71M (Agilent Technologies, Santa Clara, CA). All samples were sequenced using Illumina HiSeq 2500 platform. Raw data were analyzed by BWA (alignment on hg19 reference genome), Picard (remove PCR duplicates), and GATK (realignment, recalibration and genotype calling) [1][2][3] . As a result, approximately 726K variants were identified. Average read depth was 67.2x. (Table 1) Korea Biobank Array has been designed using imputation-aware SNP selection and provides content modules for GWAS, human diseases, and biological functional variants. There are 833,535 SNPs and indel markers on the Korea Biobank Array. Similar to UK Biobank Axiom Array, available modules of specific interest in AxiomGD were adopted for Korea Biobank Array.

Tag SNPs for genome-wide coverage (600,294 markers)
Korea Biobank Array's core imputation grid consists of approximately 600K genome-wide SNPs shared in common with the conventional Affymetrix Biobank Array. Markers were selected using Affymetrix' imputation aware marker choice algorithms considering the MAF of 7.7M common variants (MAF > 1% in 2,579 Korean sequencing data including 397 WGS and 2,179 WES)

Functional contents (35,824 markers)
Korea Biobank Array contains the following Affymetrix' Axiom platform modules of variants based on reported GWAS signals and pharmacogenomic and metabolic phenotypes.

Future plan of Korea Biobank Array project
Korea Biobank array (KBA) project will provide the largest East Asian genomic data containing both directly genotyped common and rare variants. Previously, UK Biobank produced a half million samples of genome data using UK BiLEVE array, a prototype, for about 50,000 samples and UK Biobank array for about 450,000 samples 11 . Those two arrays share about 95% of its contents. As a similar strategy used for UK Biobank genome data production, KBA was regarded as a prototype (v1.0) and an updated KBA (v1.1) was designed by excluding variants with poor genotype clusters and including additional tagging variants for less common variants (MAF 1~5%) and variants in X chromosome. These two arrays share about 93% of its contents. In the KBA project based on KOGES cohorts, about 50,000 samples were genotyped using KBA v1.0 and genotyping of remaining about 150,000 samples using KBA v1.1 will be completed at the end of 2019.
To facilitate genomic researches using Korean chip, Korean chip consortium was established in June 2016. Since then, there are about 150 domestic researchers participating in the consortium and produced genotyped data of about 50,000 disease patients using the Korean chip. Korean chip consortium is expected to discover numerous genetic variants associated with various diseases, such as diabetes, cardiac diseases, and cancers, of Koreans. The discovered variants will be valuable scientific evidence on precision medicine in Koreans.