Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals

The Tohoku Medical Megabank Organization reports the whole-genome sequences of 1,070 healthy Japanese individuals and construction of a Japanese population reference panel (1KJPN). Here we identify through this high-coverage sequencing (32.4 × on average), 21.2 million, including 12 million novel, single-nucleotide variants (SNVs) at an estimated false discovery rate of <1.0%. This detailed analysis detected signatures for purifying selection on regulatory elements as well as coding regions. We also catalogue structural variants, including 3.4 million insertions and deletions, and 25,923 genic copy-number variants. The 1KJPN was effective for imputing genotypes of the Japanese population genome wide. These data demonstrate the value of high-coverage sequencing for constructing population-specific variant panels, which covers 99.0% SNVs of minor allele frequency ≥0.1%, and its value for identifying causal rare variants of complex human disease phenotypes in genetic association studies.


Introduction
This section gives technical information that is not listed in the Method section of the main manuscript.

DNA preparation
Genomic DNA was extracted from buffy coats using the Gentra Puregene Blood Kit with the AutoPure LS automated DNA extraction robot (Qiagen). We followed the manufacturer's instructions, except that the RNase treatment step was omitted. The concentrations of doublestranded DNA were quantified using the PicoGreen dsDNA quantitation assay (Life Technologies), and adjusted to a concentration of 200 ng/μL with the Elution buffer (Qiagen). The DNA samples were stored at 4°C until they were used.

Sample management
The laboratory information management system (LIMS) was developed in-house and used for managing DNA samples, participant information (sex, age, and the type of blood group antigens), and genotype data (the results of sex checks and SNPs in the ABO gene). The ABO blood typing for the erythrocytes from the corresponding cohort participants was performed with anti-A and anti-B monoclonal antibodies (Sysmex, Japan). The genotyping of the ABO blood type loci was done with SNPs (rs8176719 and rs8176720), which were called using whole genome sequencing (WGS) (described below) and confirmed using real-time PCR 2 . Primer sequences (5'

Quality control of the SNP data
One hundred and sixty nanograms of genomic DNA were analyzed with the HumanOmni2.5-8 v1.1 DNA Analysis Kit (Illumina), following the manufacturer's instructions. Briefly, genomic DNA was subjected to isothermal amplification followed by fragmentation with nucleases. The DNA was precipitated with 2-propanol, then hybridized with oligonucleotide probes immobilized on HumanOmni2.5-8 BeadChips (8 samples per BeadChip slide). After washing, probes underwent single-base extension using the captured genomic DNA as a template, and incorporating 2, 4dinitrophenyl-or biotin-labeled nucleotides to identify the genotypes. Then, immunohistochemical staining was performed to amplify the incorporated signal. Two Robotic Universal modules (Freedom evo, TECAN, Maennedorf, Switzerland) and the Illumina Infinium LIMS system (Illumina) were used for a series of experiments. An iScan scanner system, with an AutoLoader 2.X controlled by the iScan Control Software (ver. 3.3.28: Illumina), was adopted for the data acquisition. Each SNP call was obtained using the Genotyping Module in the GenomeStudio software (ver. 2011.1: Illumina). The default set cluster file was HumanOmni2-5M-8b1-1_B.egt (Illumina). The genotyping calls were based on a clustering algorithm, and classification was performed using the Bayesian model 3 . When the genotyping score for a SNP call was below 0.15, the corresponding locus was excluded from further analysis. After having calculated the SNP call rates, the individuals with overall call rates above 99%, and with a standard deviation of the log R ratios for the autosomal SNPs below 0.2, were used for further analysis.

Identification of cryptic relatedness in the cohort sample
The SNP-QC and the identification of cryptic relatedness were carried out with the PLINK software (ver. 1.07) 4 . We extracted SNPs using the following criteria: a Hardy-Weinberg Equilibrium test (HWET) with a p-value > 0.05, a minor allele frequency (MAF) > 0.05, and a missing data rate per locus (lmiss) < 0.01.
Closely related individuals within our cohort were identified based on an identity-by-descent (IBD) estimation. Before calculating the estimated IBD value (referred as PI-HAT in PLINK) between each pair of individuals, we carried out an LD-based pruning to extract a set of SNPs that were in nearly linkage equilibrium, using the PLINK option --indep-pairwise 200 4 0.1, as follows: SNPs within a 200-SNP window were removed one by one, until no squared correlation between the SNPs in the window was >0.1, and the procedure was iterated by sliding the window by 4 SNPs at a time. The IBDs for all pairs of individuals were estimated for the pruned SNP set using the PLINK option --genome, and individuals were removed one by one, until the PI-HAT value for no individual pair was > 0.125.

Population structure analysis
The population structure of the group comprising the selected individuals in the previous section was determined using a principal component analysis (PCA) with the smartpca program in EIGENSOFT 5 . The program was used without the auto-removal function of outlier(s) for the pruned SNP set in the previous section. Outlier individuals in the PCA were removed in the following manner: we repeatedly calculated the PCA and removed the individual with the largest absolute value of the first principal component (PC1) score, until the p-value of the Tracy-Widom statistic for the PC1 value was more than 0.05.
In addition to the above analysis, we examined the population structure of our cohort of selected individuals and HapMap reference panels 6 using a PCA to detect additional outliers. We selected SNPs that are in autosomes and satisfy the following four criteria for further analysis: polymorphic (MAF ≥ 0.05) in the ToMMo individuals, with a per-SNP call rate ≥0.95, with a HWET p-value > 0.001, and genotyped as part of the HapMap project. A PCA was separately applied to the two genotype references: the international HapMap samples and the East Asian samples (JPT and CHB). The SNP sets were pruned by LD (r 2 <0.1) using PLINK, and 57,547 SNPs and 46,809 SNPs remained for these two reference panels, respectively.
We inspected the distribution of the ToMMo individuals in the two-dimensional PCA plots of the first and second PCs, respectively, and five outliers were removed (Supplementary Fig. 1). Furthermore, the population structure of the ToMMo individuals in the Japanese population was examined using a PCAj analysis (a population structure prediction system for the Japanese which is based on the largest analysis of the Japanese population structure) 7 . By using the PCAj analysis, we obtained the distribution of the ToMMo individuals by predicting the top eigenvalues for the ToMMo individuals. Using this analysis, we excluded one additional individual who was outside of the major (Hondo) cluster of the Japanese population 8 .

HLA types of Japanese
The HLA loci on chromosome 6p21.3 together make up one of the most diverse and polymorphic regions in the human genome. Conventionally, HLA types have been determined at the 2-digit resolution (e.g., A*01), which approximates the serological antigen groupings. More recently, the sequence-specific oligonucleotide probes (SSOP) method has been used for HLA typing at the 4digit resolution (e.g., A*01:01), which can distinguish amino acid differences 9 . Here, we employed a sequencing-based approach to HLA typing. Since the sequence-based approach can directly determine both coding and non-coding regions, it can achieve HLA typing at the 8-digit (e.g., A*01:01:01:01) resolution. We predicted the HLA types of the 1KJPN samples with a computational tool, HLA-VBSeq 10 . HLA-VBSeq estimates the most probable HLA alleles at full (8-digit) resolution from whole-genome sequence data. HLA-VBSeq simultaneously optimizes read alignments to the HLA allele sequences and the abundance of reads on the HLA alleles by variational Bayesian inference. HLA typing with HLA-VBSeq was carried out as follows.
First, reads obtained by whole-genome sequencing were aligned to the reference genome (GRCh37/hg19), using decoy sequences (hs37d5) with an alignment tool, BWA-MEM 11 . Second, reads aligned to HLA loci (HLA-A, -B, -C, -DM, -DO, -DP, -DQ, -DR, -E, -F, -G, -H, -J, -K, -L, -P, -V, -MIC, and -TAP) and unmapped reads were extracted from the BAM file using SAM tools 12 . If one of the paired-end mates was aligned to an HLA locus and the other was not, then both reads in the pair were extracted and used in downstream analyses. Then, the extracted reads were re-aligned to the collection of all the genomic HLA allele sequences in the IMGT/HLA database 13 (release 3.17.0), in which multiple alignments to the reference sequences for each read are allowed, with using the "-a" option in BWA-MEM. The expected read counts on the HLA alleles were estimated by variational Bayesian inference under a statistical framework. After the inference algorithm converges, the HLA types of HLA-A, -B, and -C loci were predicted, based on the expected number of reads assigned to each allele. A threshold for the depth of coverage of the HLA alleles was set. In our analysis, we set the threshold at a 5x depth of coverage. The details of the algorithm are described in the previous literature 10 . We compared the frequencies of HLA alleles predicted in the 1KJPN with those determined in another Japanese population with 1,018 samples 1 (Fig. 3d, Supplementary Fig. 6c, 6d). The HLA alleles that exist in at least one individual in both populations were considered for comparison.

Haplotyping of 1070 Japanese individuals
We first constructed a phased reference panel of the 1KJPN, using the SHAPEIT2 14 software (ver. 2.r644) without singletons. The high-sensitive SNVs plus the short insertions and deletions were included in the variant set. We then determined the phase of the singletons in the following manner: (i) Locally haplotyped regions were obtained using HapMonster 15 software, which uses sequence reads spanning multiple heterozygous positions to perform local haplotyping.
(ii) If the phasing result of a locally haplotyped region from HapMonster was completely concordant with that estimated by SHAPEIT2, we added singletons in the region to the phased reference panel and phased them according to the local haplotype in the region from HapMonster. If phasing results from SHAPEIT2 and HapMonster were discordant due to some variants in the region, singletons in the region were ignored in the phased reference panel. (iii) Singletons not included in locally haplotyped regions were ignored in the phased reference panel. 43% of all the singletons were corrected with above procedures.

Imputation performance
We imputed the genotypes of 131 Japanese individuals not included in the 1KJPN, based on the genotypes at designed sites in the HumanOmni2.5-8 BeadChip, by using the IMPUTE2 16 software (ver. 2.2.2). The genotypes of these individuals were obtained using the same sequencing protocol and the same variant calling pipeline as were used in determining the high-sensitive SNVs set, and used as the gold standard. For the IMPUTE2 options, Ne and k_hap values were set to 20,000 and 1,000, respectively. In addition to the 1KJPN, we considered the following three reference panels for imputation, to evaluate their performance: the reference panel from the 1KGP released in Dec. 2013 and containing 1,092 cosmopolitans, a reference panel of 89 JPT individuals from the 1KGP, and a reference panel comprising both the 1KGP and the 1KJPN. To assess the agreement between the imputed genotypes and the genotype calls from NGS, we calculated the squared Pearson correlation coefficient between the gold standard genotypes, taking respective integer values of 0, 1, and 2, and values of allele dosages of imputed genotypes from 0 to 2, as in the literature 16 . Fig. 4a gives r 2 values averaged for each MAF bin size, for SNVs that exist in both the 1KJPN and the 1KGP. The MAF for each SNP was calculated independently for each reference panel. The mean r 2 value of the 1KJPN was higher than those of the other populations across the range, and the use of the ToMMo-1KGP for imputing genotypes in the Japanese population was effective, especially for low-frequency variants.

GWAS of the Moyamoya disease with imputation
We performed imputation based on a genotype dataset from a case-control study on the Japanese moyamoya disease (MMD) 17 , using the reference panel of the 1KJPN. The dataset contains the genotypes of 72 Japanese MMD patients and of 45 healthy Japanese controls, from the International HapMap Project, at 1,140,419 sites designed on the Illumina HumanOmni1-Quad BeadChip. The genotypes of the case samples were obtained from the Illumina HumanOmni1-Quad BeadChip, and those of the control samples were obtained from the database of the International HapMap Project. After converting the genotype coordinate from hg18 to hg19 using liftOver, sites with a missing genotype rate > 0.01 were removed, and the remaining 975,719 sites were used in the analysis. The strand information was corrected based on the concordance of the alleles or on MAF in the 1KJPN. We performed a chi-squared test implemented in PLINK 1.07 --assoc option on these SNPs for the original and imputed datasets. For the association study on the imputed dataset, the best-guess genotypes on imputed variant positions with info metric from IMPUTE2 more than or equal to 0.5 were considered. With a significance threshold of p-value < 5.25x10 -8 , only a synonymous SNP, rs11870849, located at the coding region of ENDOV (chr17:78411073) and with a p-value of 6.95x10 -9 was identified in the original (non-imputed) dataset (Fig. 4b). In the imputed dataset, a nonsynonymous SNP, rs112735431, located in RNA213 (chr17:78358945), was identified as having the highest SNP association with a p-value of 8.07x10 -10 , which is beyond the significance threshold, at a p-value < 5.06x10 -9 (Fig. 4d). We also applied the reference panel from the 1KGP to the MMD dataset for the imputation. A Manhattan plot from the imputed genotypes appeared to contain many spurious associations with MMD ( Supplementary Fig. 7a). Since the control samples of the MMD dataset are also included in 1KGP as HapMap JPT samples, we performed association studies for genotypes imputed with the panel without HapMap JPT samples. Supplementary Fig. 7b shows a Manhattan plot from the imputed genotypes, where rs112735431 was not detected as a significant variant.

Individual mutation load
We identified overlaps between the high-confidence SNVs and known pathological SNVs from the Human Gene Mutation Database (HGMD) 18 . Possible pathological SNVs were identified based on the genomic coordinates and on the consistency of the allele bases. After this filtering step, 4,368 pathological SNVs in the HGMD, including 1,002 disease-causing mutations (DMs), were extracted. Stop-gained alleles were annotated using the SnpEff software (ver. 3.3c). For each individual, the number of these functional variants (HGMD-DM and stop-gained alleles) was counted for each variant class. We used sites of functional variants that satisfy the following conditions; (i) the ancestral states were inferred with high-confidence, (ii) the functional variants were non-reference (alternative) alleles, and (iii) the functional variants were derived alleles. We discarded stop-gained SNVs if the proportion of truncated ORFs was less than 5%.

Variants discovery rate
Although the variants discovery rate can be predicted from preliminary data 19,20 , we estimated the discovery rate using variants data obtained in this study. Let us consider a sampling probability of variants ( ) from a population with the allele frequency distribution F(q) where q is an allele frequency of a variants in very large population. In a sample of n individuals comprising 2n chromosomes from the population, the sampling probability of variants with minor allele frequency equal or greater than q min can be written as follows:

Demographic inference
In a constant population size, the allele frequency distribution under the equilibrium state is ( ) ( ) . However, such setting is not realistic for human population where the complex demographic events such as recent size expansion and bottleneck have been experienced. Thus, we inferred the demographic history of the 1KJPN was inferred using a forward-time simulation under the Wright-Fisher diffusion model 21 (Supplementary Fig. 3). The time-course of the allele frequency (q) distribution in a population, f(q, t), is approximated by solving the following diffusion equation: where M(q) and V(q) represent the mean and variance of the change in the allele frequency at frequency q, respectively. Consider a variable population size with additive selection in the Wright-Fisher model, and let ρ(t) and N 0 be the relative population size and the effective population size at time 0, respectively, V(q) and M(q) are respectively written as follows: where s is an advantageous selection coefficient. The time unit of Equation (2) can be converted into a unit of 2N 0 generations: where S=2N 0 s and T=2N 0 t. Equation (3) was solved numerically with uneven grid spacing, as previously described 22 . The changes in the population size are represented by a series of linear equations, whose tips are connected each other.
where m i is the time in units of 2N 0 , and λ i is the relative population size at time m i . Time intervals between m i and m i+1 was set to be proportional to 1/i(i+1) so that the time interval become shorter as time is closer to the present time. The expected number of sites with an i derived allele in a sample size n, i.e., the site frequency spectrum (SFS), is calculated using the following equation: where θ=2N 0 U and U is the mutation rate for a set of sites of interest (e.g. intergenic). Let X=(x 1 , x 2 , … , x i , …, x n/2 ) be the observed number of sites with a minor allele count i. Assuming that each x follows an independent Poisson distribution, the likelihood function with folded SFS X is written as the following equation. The m i s and time of simulation τ were set a priori, and the λ i and θ values were estimated by maximizing likelihood function L(λ:X), with the L-BFGS-B algorithm 23 . Applying the maximum likelihood estimates of demographic parameter ̂, the allele frequency distribution F(q) in Equation (1) can be obtained as follows: The demographic parameter ̂ was estimated from the SFS of the high-confidence SNVs of intergenic regions (Supplementary Fig. 3a) and the number of time intervals was set to be 12. The inferred demographic model was shown in Supplementary Fig. 3b. The variants discovery rate in Fig. 1d was estimated according to Equation (1) under this model.