iJGVD: an integrative Japanese genome variation database based on whole-genome sequencing

The integrative Japanese Genome Variation Database (iJGVD; http://ijgvd.megabank.tohoku.ac.jp/) provides genomic variation data detected by whole-genome sequencing (WGS) of Japanese individuals. Specifically, the database contains variants detected by WGS of 1,070 individuals who participated in a genome cohort study of the Tohoku Medical Megabank Project. In the first release, iJGVD includes >4,300,000 autosomal single nucleotide variants (SNVs) whose minor allele frequencies are >5.0%.

The set of variants in iJGVD was released from 1KJPN, which was constructed with data from the WGS of 1,070 healthy Japanese individuals in the Tohoku Medical Megabank Project. 12 The 1KJPN subjects were adult individuals (age ⩾ 20 years) whose Japanese ancestry was confirmed, and close-relatives were excluded (see Supplementary Figure 1 for statistics regarding age and sex). All participants gave written informed consent.
In this project, the genomic DNA of 1,070 subjects obtained from peripheral blood samples was subjected to paired-end sequencing using the Illumina HiSeq 2500 platform. All sequencing libraries were constructed based on PCR-free methods. 13 The sequence reads were mapped onto the human reference genome, assembly GRCh37/hg19, with decoy sequences (hs37d5) and an average sequencing coverage of 32.4 × for full-length autosomal chromosomes. Variant calling and subsequent filtering were performed by an in-house bioinformatics pipeline. 14,15 The details of methods and quality controls are described in Nagasaki et al. 12 Among the total variants in 1KJPN, autosomal SNVs whose minor allele frequencies were 45% were selected. These SNVs were annotated with their corresponding database SNP (dbSNP) IDs and their effects on gene products were predicted using SnpEff 16 . SNVs were selected if the variants were reported in dbSNP138 3 , and the iJGVD release (Version 1.0) included a final sample size of 4,301,546 SNVs.
The iJGVD system consists of (i) the relational database and (ii) the web server ( Figure 1a). The relational database (using MySQL 5.1.73) for iJGVD includes SNV alleles, genomic positions based on the GRCh37/hg19 coordinates, allele frequencies, the corresponding dbSNP IDs, P values for the Hardy-Weinberg equilibrium test, gene annotations and so on. The web server consists of functions to search SNVs and explore the region surrounding an SNV based on chromosome coordinates. The web server and exploration functions were implemented in PHP 5.3.3 and JBrowse 1.11.5, respectively.
Among the 4,301,546 SNVs, 1.72% were located in exonic regions (i.e., untranslated regions or coding regions). The minor  iJGVD: an SNV database of 1070 Japanese genomes Y Yamaguchi-Kabata et al allele frequency distribution for the SNVs in iJGVD was examined ( Table 1). The SNV counts for each frequency class were not uniform, and the sample was enriched for low-frequency SNVs. We compared the allele frequencies of SNVs in iJGVD with those of SNVs in HapMap3 10 JPT (Japanese from Tokyo) individuals (Figure 2a). The allele frequencies in the two populations were very similar (the correlation coefficient was 0.99). We also tested statistical difference in allele counts between ToMMo 1KJPN and HapMap3 JPT, and found that only a small fraction (0.022%, 226 out of 1,020,909) of SNVs showed P values of o 10 − 8 (see Supplementary Figure 2 for QQ-plots). This fraction of SNVs with small P values was very similar with that for the comparison between NGS data and SNP array data in the JPT population (Figure 2b).
SNVs in iJGVD can be searched by specifying the gene symbol, rsSNP ID, or genomic position (Figures 1b and c). Hits are displayed in a table of SNVs with allele frequencies in sequential order based on their genomic coordinates. The table can be downloaded as a text file by clicking 'Download Table.' SNVs can also be queried using the genome browser by specifying the chromosome and genomic position. The genome browser (Figure 1d) provides graphical views of the genomic location of SNVs with locations of known genes and other SNVs in dbSNP.
We constructed a public database of genomic variants with allele frequencies for the Japanese population. Variant databases for the Japanese population to date have been based on targeted SNP typing 6 or whole-exome sequencing. 17 iJGVD is the first database of genomic variants for Japanese individuals based on high-coverage WGS. A set of variants and the corresponding frequency information from WGS would provide a comprehensive platform for finding disease-causing variants because they can be found in non-coding regions. The allele frequencies of SNVs in iJGVD and in the HapMap3 JPT population are highly correlated ( Figure 2b). Furthermore, our database contains allele frequencies for more than three million additional high-quality SNVs that were not genotyped in the HapMap3 project. We recently designed a genotyping chip, 'Japonica Array', which was optimized for the Japanese population, 18 and probes for autosomal SNPs on Japonica Array can be seen in iJGVD.
We plan to improve the usefulness of iJGVD by adding biological annotations for SNVs and expanding search options using these annotations. Furthermore, information of linkage disequilibrium will be considered for additional data. Although iJGVD contains only SNV information at present, insertions, deletions and other structural variants will be included after quality control processes are implemented. We believe that our open variant data will be useful in medical genomics, especially for comparisons of allele frequencies in iJGVD with those of the patient group for a target disease to identify disease-causing variants.
All SNV frequency data in iJGVD are available from the National Bioscience Database Center Human Database (http://humandbs. biosciencedbc.jp/) under accession hum0015.