Characterizing rare and low-frequency height-associated variants in the Japanese population

Human height is a representative phenotype to elucidate genetic architecture. However, the majority of large studies have been performed in European population. To investigate the rare and low-frequency variants associated with height, we construct a reference panel (N = 3,541) for genotype imputation by integrating the whole-genome sequence data from 1,037 Japanese with that of the 1000 Genomes Project, and perform a genome-wide association study in 191,787 Japanese. We report 573 height-associated variants, including 22 rare and 42 low-frequency variants. These 64 variants explain 1.7% of the phenotypic variance. Furthermore, a gene-based analysis identifies two genes with multiple height-increasing rare and low-frequency nonsynonymous variants (SLC27A3 and CYP26B1; PSKAT-O < 2.5 × 10−6). Our analysis shows a general tendency of the effect sizes of rare variants towards increasing height, which is contrary to findings among Europeans, suggesting that height-associated rare variants are under different selection pressure in Japanese and European populations.

H uman height is a highly heritable trait under polygenic inheritance 1 . Over the past decade, genome-wide association studies (GWAS) have identified more than 3500 variants associated with human height [2][3][4][5][6][7][8] . Furthermore, recent studies using high-density genotype imputation based on the whole-genome sequencing data [4][5][6] and commercial genotyping array focused on coding variants 7 and uncovered dozens of rare and low-frequency variants with greater impacts on human height. Meanwhile, investigations in non-European populations have been performed with relatively smaller sample sizes 3 . Therefore, there were limited variants reportedly associated with height in non-European population, such as East Asians. Considering that diverse populations harbor ancestral-specific rare variants 9 , searching for such genetic variations in East Asian population with large samples would expand our knowledge on genetic components associated with complex traits.
Here we report a large-scale association study (N > 190,000) of human height in a Japanese population using a newly constructed reference panel for genotype imputation, including deep wholegenome sequence (WGS) data of 1037 Japanese 10 . Our primary aim of this study is to identify rare (minor allele frequency (MAF) <1%) and low-frequency (1% ≤ MAF < 5%) variants associated with height in the Japanese population. To characterize the roles of the identified variants, we further evaluate the gene-level associations and pleiotropic effects of the identified genes and variants, and investigate the natural selection on height-associated rare variants. Through the study, we identify 573 variants associated with height in the Japanese population. Of these, 22 are rare and 42 are low-frequency variants. We also reveal two height-associated genes that have not been reported. Moreover, our analysis of selection on height-associated variants suggests different selection pressure between Japanese and European populations.

Results
Reference panel for genotype imputation. The study design is illustrated in Supplementary Fig. 1. To achieve a better imputation accuracy for rare and low-frequency variants, we constructed a reference panel (N = 3541) composed of deep WGS data obtained from 1037 Japanese individuals 10 (BBJ1K; Supplementary Note 1) and the 1000 Genomes Project 9 (1KG; phase3v5, N = 2504). We compared the imputation accuracy with that of other reference panels and found that the integrated panel exhibited the best imputation accuracy, especially for rare and low-frequency variants ( Fig. 1 and Supplementary Fig. 2).
Genome-wide association study. After the application of stringent quality control (QC) and whole-genome imputation to the new reference panel, we conducted a GWAS in 159,195 participants of the Biobank Japan (BBJ) project 11,12 using 27,896,057 imputed variants (Supplementary Fig. 3a and Supplementary Table 1). As expected, we observed high genomic inflation (λ GC = 1.20 for all variants and 1.69 for variants with MAF ≥1%; Supplementary Fig. 3b). Since we used a linear mixed model 13 to test association, we interpreted this inflation as being due to polygenic effects. We quantified the bias resulting from population stratification and cryptic relatedness using linkage disquilibrium (LD) score regression 14 . The mean χ 2 value was 2.63, intercept was 1.21 ± 0.02 (mean ± standard error [s.e.]) and ratio was 0.13. This proportion was comparable with those in our previous GWASs for quantitative traits in the BBJ 15,16 .
We defined genetic loci by considering the positional relationships of the variants reached genome-wide significance (P GWAS <5.0 × 10 −8 ; Supplementary Fig. 4a). If the significantly associated variants at two different loci were more than 1 Mb apart from each other, we regarded these loci as independent. As a result, we observed 363 independent loci, including nine novel loci with genome-wide significance (Supplementary Data 1, 2 and Supplementary Table 2).
Next, to investigate independently associated signals underlying the identified 363 loci, we performed several rounds of conditional analyses by repeating the association analysis using BOLT-LMM until the association of the top variants fell below the significance threshold. This analysis revealed an additional 246 independently associated variants satisfying the genome-wide significance within 130 loci (Supplementary Data 2, 3 and Supplementary Fig. 5). In total, 609 lead variants reached a genome-wide significant level in the GWAS. We identified a high number of independently associated signals near ACAN (N signal = 7), ADAMTS17 (N signal = 11), and IGFBP5-IHH-RESP18 (N signal = 10), which were also observed in a study conducted in a European population 2 . When we examined the frequencies of the identified variants in the 1KG (Supplementary Data 4 and Supplementary Fig. 6), 40 were found to be monomorphic in all non-East Asian populations, suggesting that these variants are East Asian specific.
We evaluated 609 identified variants in 32,692 independent individuals as a replication study (Supplementary Note 2, Supplementary Table 1, and Supplementary Data 5). Among these variants, 415 (68.1%) were nominally associated (P rep < 0.05) in the same direction of effects, and 598 variants (98.2%) showed directional consistency (P for sign test = 9.38 × 10 −161 ), suggesting that the majority of the identified variants were not false positives. We also observed a strong correlation of the effect sizes between the GWAS and replication (Pearson's r = 0.88, P = 3.47 × 10 −194 , Supplementary Fig. 7). After a fixed-effect meta-analysis, 573 variants (95.5%) remained genome-wide significant (Supplementary Data 5) and we determined that these were significantly associated throughout the present study. Of these variants, 124 (21.6%) were replicated with Bonferroni-corrected level of associations (α = 8.21 × 10 −5 [=0.05/609]) and 289 (50.4%) were replicated with nominal significance (8.21 × 10 −5 < P rep < 0.05). We did not observe Bonferroni-corrected level of heterogeneity of effect sizes between GWAS and the replication study (P het > 8.21 × 10 −5 ). Among the 573 variants, 22 were rare and 42 were lowfrequency variants. The effect sizes of the rare and low-frequency variants showed stronger effects than those of common variants (Fig. 2). The rare and low-frequency variants that remained genome-wide significant after the meta-analysis (N = 64) together explained 1.7% of the phenotypic variance in the replication set (Tables 1 and 2), which was comparable to findings in the European population (1.7% by 83 variants) 7 .
We searched for 83 coding variants with low MAFs identified in the largest GWAS conducted to date (GIANT consortium 7 ) and identified 30 variants evaluated in our GWAS (25 were rare or low-frequency variants in the Japanese population). Among them variants, eight were nominally associated (P GWAS <0.05; Supplementary Table 3). We also looked up other height-associated variants in the European population 4-6 (Supplementary Table 4). Notably, we replicated the association of rs143840904 in the KCNQ1 region (P GWAS = 6.4 × 10 −19 , MAF = 1.6%; Supplementary Fig. 8) that was originally reported by Sardinians 4 . In that previous study, either rs143840904 or rs150199504 was implicated in height based on considering the LD pattern and their allele frequencies in the South Asian population. However, rs150199504 is monomorphic in the BBJ1K and EAS samples of 1KG 9 , and we therefore suspected that rs143840904 was the causal variant of this locus. We also looked up at the associations of variants reported by the meta-analysis of GIANT consortium and UK Biobank (UKB) 8 . By considering that the reported lead variants were selected by GCTA-COJO 17 , we considered 2481 out of 3290 variants showing genome-wide significant associations in the nonconditioned analysis. Of these variants, 1417 variants were evaluated in our GWAS (Supplementary Data 6), and 772 variants were nominally associated (P GWAS < 0.05) with the consistent direction of effects. These 772 variants represented a significant positive correlation in the effect sizes between our GWAS and the meta-analysis of GIANT and UKB (Pearson's r = 0.82, P = 7.00 × 10 −189 ), suggesting that these are shared height-associated variants across populations.
Functional annotations and pleiotropic effects. We annotated the variants that were in LD (r 2 > 0.8) with the 573 identified variants and found 117 nonsynonymous variants in 98 genes (Supplementary Data 7). Of these genes, 13 have been reported as genes responsible for height-related anomalies in humans (Supplementary Table 5). Among the nonsynonymous variants, 113 were missense variants, 1 was a non-frameshift deletion, and 3 were putative loss-of-function variants, including a low-frequency stop-gain variant (SCIN R56X) that is extremely rare in other populations. SCIN has been reported to mediate osteoclastogenesis and is involved in developmental dysplasia of the hip 18 .
Gene-based analysis. To pinpoint the genes influencing height, we evaluated gene-level associations with SKAT-O 19 in samples analyzed through GWAS and identified 52 genes that were significantly associated with height (P SKAT-O < 2.5 × 10 −6 ; Fig. 3a and Supplementary Data 8). Although most of the observed gene-level associations seemed to represent the single best-associated nonsynonymous variants in each gene, we found stronger associations by grouping other nonsynonymous variants in the two genes SLC27A3 (P SKAT-O = 6.94 × 10 −38 ) and CYP26B1 (P SKAT-O = 1.33 × 10 −45 ) ( Table 3 and Fig. 3b), indicating multiple associated low-frequency variants (Fig. 3c, d). All of the minor alleles of the variants showing a nominal association (P GWAS < 0.05) exerted height-increasing effects in both genes (Supplementary Data 9, 10). The replication sets confirmed the associations (Table 3). To further investigate whether the identified gene-level associations of two genes resulted from the associations of multiple associated variants, we performed a conditional analysis. We observed significant gene-level associations remained after conditioned on the best-associated single nonsynonymous variant (P SKAT-O = 3.38 ×  Table 7). The missense variant in CYP26B1 (p. L264S) was found to show a subgenome-wide significant association in an African population in the recent analysis by the GIANT consortium (P = 1.13 × 10 −7 ) 7 . We confirmed that this variant showed an association in both our GWAS (P GWAS = 2.5 × 10 −7 , MAF = 9.2%) and the replication set (P rep = 4.8 × 10 −3 , MAF = 9.4%) (Supplementary Data 10). Although we did not include CYP26B1 p.L264S in the gene-based analysis, due to its higher MAF, this observation also supported the conclusion that multiple nonsynonymous variants of CYP26B1 are associated with human height. Disruption of CYP26B1 gene has been reported to cause fusion or overgrowth of cartilaginous structures in mice and zebrafish and leads to craniosynostosis and multiple skeletal anomalies in humans 20 ; however, the impact of SLC27A3 on human phenotypes has not been well characterized. Therefore, we evaluated their pleiotropic effects on 17 human traits and found that SLC27A3 showed significant associations with lipid-related traits (P SKAT-O = 4.30 × 10 −4 and 8.20 × 10 −5 , for total cholesterol and triglyceride, respectively), and CYP26B1 influenced the body-mass index (P SKAT-O = 2.03 × 10 −4 ) ( Supplementary Fig. 10). Considering that SLC27A3 is the fatty acid transporter (FATP) gene, it is likely that further functional investigation of SLC27A3 will reveal the biological link of FATP genes with the regulation of height.
Pathway analysis. To gain further biological insight, we performed a gene set enrichment analysis using PASCAL 21 with reconstituted gene sets, which was implemented in DEPICT 22 . We observed significant enrichment in 221 gene sets (false discovery rate (FDR) q < 0.05; Supplementary Data 11). Of these 221 significant gene sets, 37 gene sets were not reported in the previous study of GIANT consortium 7 . Since these gene sets were correlated with each other, we clustered the identified gene sets into 16 categories using the affinity propagation (AP) clustering method 23 and determined the exemplars in each category. Although gene sets represented by regulatory region DNA binding (GO: 00010607) have not been previously reported, the other 15 categories contained one or more gene sets previously reported by the GIANT consortium 7 , and the overall observations should support the notion that the biological mechanisms regulating human height are shared across different populations.
Variance explained by the analyzed variants. Next, we estimated the phenotypic variance explained by the variants analyzed in the replication sets (N = 25,419) using the MAF-stratified genomicrelatedness-based restricted maximum likelihood (GREML-MS) method 24 with GCTA 25 , which enabled us to evaluate the explained variance of the sets of variants stratified by MAF. When we investigated all imputed autosomal variants, it was estimated that 52.4 ± 4.5% of the total phenotypic variance could be explained. Variants with a MAF <10% explained 20.3 ± 4.6% of the variance. We further split the low-frequency range at a MAF of 5% and found that variants with a MAF <5% explained the largest proportion of the phenotypic variance in human height (14.1 ± 4.7%; Supplementary Fig. 11). Considering that the We plotted the allele frequency (x-axis) and effect size (y-axis) of the minor alleles of the 609 variants that reached genome-wide significance in the discovery GWAS. Variants that remained genome-wide significant after the meta-analysis are indicated in blue. The orange line denotes 80% power to achieve genome-wide significance. To convert the standardized effect size to a centimeter scale, we regarded 1 standard error as 5.63 cm, which was calculated in the population-based cohorts (replication set) identified rare and low-frequency variants explained only 1.7% of the phenotypic variance in our samples, further investigations of rare and low-frequency variants are warranted to elucidate the genetic architecture of human height.
Analysis of natural selection. Recent studies have indicated that the variants associated with human height are under natural selection 4,24,26 ; however, such evidence has been limited to the European population. By examining the relationships between allele frequencies and effect sizes through GWAS, we investigated the effect of selection on height-associated variants in the Japanese population (Fig. 4). We observed that the rarer variants tend to have height raising effects (Spearman's ρ = −0.63, P < 1.0 × 10 −6 ), implicating negative selection on height-increasing alleles. This relationship showed a clear contrast to findings in Europeans (in whom rarer variants entirely showed height-decreasing tendency after grouped by their MAF) 24 . The independent replication set confirmed this finding (Spearman's ρ = −0.69, P < 1.0 × 10 −6 , Supplementary Fig. 12). This dissimilarity implies differences in selection pressure on height-associated variants in the Japanese and European populations. Finally, we assessed the possible artifacts in the analysis of relationship between minor allelic effects (MAEs) and MAFs. First, we considered the bias resulting from genotype imputation. We evaluated this tendency using the directly genotyped variants, and observed the consistent relationship between MAEs and MAFs (Supplementary Fig. 13; Spearman's ρ = −0.39, P = 7.38 × 10 −6 ), suggesting that the observed height-increasing tendency in rarer variants did not result from the bias emerged from imputation. We also considered the impact of convergence in the BOLT-LMM on this analysis. We therefore restricted the variants with standard errors of estimates ≤0.2 and 0.5, and reevaluated the relationship between MAFs and MAEs. We observed similar relationship (Supplementary Fig. 14; Spearman's ρ = 0.61 and 0.69 for variants with SE < 0.2 and 0.5, respectively). Based on these results, we interpreted that the impact of convergence on our result is minimal. Next, in order to assess the possible contribution of population stratification, we selected the samples belonging to the Japanese main cluster and performed GWAS (Supplementary Fig. 15a; N = 148,937). When we evaluated the correlation between MAFs and MAEs, we still observed significant correlation (Supplementary Fig. 15b; Spearman's ρ = −0.35, P = 3.35 × 10 −4 ).

Discussion
Through a large-scale genome-wide analysis in a single non-European population, we identified 64 rare and low-frequency variants and two genes associated with human height. The discovery of the rare and low-frequency height-associated variants was led by the integrated reference panel for genotype imputation, which was constructed with population-specific wholegenome sequencing data 10 . Considering that dozens of the identified genetic variants appeared to be specific to East Asians, investigations of rare and low-frequency variants in non-European populations will contribute to expand the catalog of genes and genetic variations associated with complex traits. Our findings should encourage further investigations of diverse populations in complex trait genetics.
Our results suggested similarities of genetic architecture of human height between Europeans and East Asians. First, we showed the correlation of effect sizes of height-associated variants between East Asians and Europeans, implying that a certain proportion of height-associated variants are shared across populations. Second, the pathway analysis implicated that the large proportion of the identified gene sets were shared between Table 1 Height-associated rare variants in the Japanese population    ours and those found in Europeans. This observation supports the shared biological mechanisms and genetic contributions on human height. Third, the estimated heritability stratified by MAF was comparable between ours and the previous study of Europeans. These observations will help to deepen our knowledge on genetic architecture of human height. On the other hand, we showed some advantages of investigating non-European population, such as identification of dozens of the East Asian-specific height-associated variants, the likely causal variant at KCNQ1 region, and two novel gene-level associations, which were not reported in the previous study. Moreover, by utilizing the phenotypic resources of the Biobank in Japanese population, we evaluated the pleiotropy of the identified rare and low-frequency variants and genes, and found three pleiotropic effects of two rare variants and three gene-level pleiotropic associations in two newly identified genes with multiple rare and low-frequency height-associated variants. These findings contribute to characterize the rare and low-frequency variants associated with height in the Japanese population.
We investigated the negative selection on height-associated variants, and found the relationship that rarer variants showed heightincreasing tendency. We note that this tendency appeared for the grouped variants with mean MAF of at most 0.1%. Among the lead variants significantly associated in the present study, only one variant (rs202237966) was MAF <0.1%, suggesting that investigation of the negative selection using significantly associated variants requires much larger sample size. Considering that investigation of rare variants in a Japanese population suggested difference in negative selection on human height, further multi-ethnic comparisons of rare genetic components may expand our knowledge on the selection pressures underlying complex traits. On the other hand, the correlation in the Japanese main cluster was weakened, suggesting the possible bias from population stratification. Future study using novel statistical method, which is not influenced by population stratification, will be needed to confirm our finding.
In conclusion, we performed a large-scale genetic association study in the Japanese population, and identified 573 variants including 22 rare and 42 low-frequency variants associated with height in the Japanese population. Our results provide insights into the similarities and differences underlying genetic architecture of human height, and underscore the utility of investigating diverse populations to deepen our understanding of the genetic architecture of human complex traits.

Methods
Subjects. Clinical information, including height, and the genome-wide variant data of the samples included in the GWAS were obtained from the BBJ project 11,12 , which enrolled 200,000 participants. We set the eligibility criteria for the GWAS samples as follows: (1) age ≥ 18, (2) available height information, and (3) height within three times the interquartile range of the upper/lower quartile. We calculated Z-scores  from the residuals of linear regression against height using age, age 2 , and sex, and excluded individuals who fell outside ±4 standard deviations. We used genotyping array results to further exclude samples with lower call rates (<98%), closely related individuals using PLINK 27 (PI_HAT > 0.175). We determined outliers from the Japanese cluster by visual inspection based on the principal component analysis (PCA) using GCTA software 25 (ver1.25) with the samples of HapMap project 28 ( Supplementary Fig. 16). Finally, 159,095 individuals were included in the GWAS. There was no overlap between samples included in the GWAS and BBJ1K. The samples used in the replication study were collected from four Japanese population-based studies (Iwate Tohoku Megabank Organization (IMM), Japan Multi-Institutional Collaborative Cohort Study (JMICC), Japan Public Health Center-based Prospective Study (JPHC), and Tohoku University Tohoku Medical Megabank Organization (ToMMo)). We applied the same inclusion and exclusion criteria to the clinical information of the participants. To increase statistical power, we excluded overlapping samples (estimated by PI_HAT) and PCA outliers from the East Asian population and finally included 32,692 individuals in the replication study.
In all participating studies, informed consent was obtained from all participants by following the protocols approved by the corresponding institutional ethical committees before enrollment. The clinical characteristics of each cohort are provided in the Supplementary Table 1. This study was approved by the ethics committee of RIKEN and the Institute of Medical Science, the University of Tokyo.
To combine the 1KG phase3v5 (n = 2504) and BBJ1K data using IMPUTE2 29,30 , we excluded variants located at multi-allelic sites and estimated haplotypes using SHAPEIT. After combining the two datasets, we excluded singletons and variants at multi-allelic sites from the reference panel using bcftools (ver. 1.3.1) and vcftools (ver. 0.1.14). Since the current version of IMPUTE2 is not applicable to haploid chromosomes, we used BEAGLE to impute WGS data for X chromosomes in males and excluded singletons and variants located at multi-allelic sites before merging with a female dataset.
To evaluate the imputation accuracy of different reference panels, we performed whole-chromosome 1 imputation in GWAS samples after excluding variants that were included on the Illumina HumanExome BeadChip. For the imputation reference panels, we used the genotype data from the following samples: (1) 1KG phase1v3 EAS samples, (2) 1KG phase3v5 ALL samples, (3) BBJ1K samples, and (4) the combined 1KG phase3v5 ALL and BBJ1K samples. We carried out prephasing and imputation using Eagle (v2) 31 and minimac3 32 , respectively. The accuracy of the imputation was evaluated by Pearson's r 2 between the GWAS array genotypes and the imputed allelic dosages.
Genotyping and imputation. Samples included in the GWAS were genotyped using either of the following genotyping arrays: (1) the Illumina Huma-nOmniExpressExome BeadChip or (2) a combination of the Illumina Huma-nOmniExpress and HumanExome BeadChips. For QC of genotypes, we excluded variants meeting any of the following criteria: (1) call rate <99%, (2) P value for Hardy-Weinberg equilibrium <1.0 × 10 −6 , and (3) number of heterozygotes <5. After we proceeded through these QC steps using all genotyped samples in the BBJ set, 939 samples from BBJ1K were included. We further compared the genotypes of the overlapping variants between the GWAS array and WGS and excluded variants with a concordance rate <99.5% or a non-reference discordance rate ≥0.5%. In the GWAS, we used Eagle 31 for haplotype phasing without an external reference panel and Minimac3 32 for imputation, respectively. For the purpose of QC, we used variants with an Rsq ≥0.3 in the association analysis.
In the replication set, all participants were genotyped with the Illumina HumanOmniExpressExome BeadChip. We excluded variants meeting any of the following criteria: (1) call rate <98%, (2) P value for Hardy-Weinberg equilibrium <1.0 × 10 −6 , and (3) a minor allele count of <5. Thereafter, we compared the allele frequency between the genotyped samples and BBJ1K and excluded variants showing a >5% difference in allele frequency. Pre-phasing was carried out using SHAPEITv2 33 . We imputed variants with Minimac3 using the same reference panel as in the GWAS. We considered variants with an Rsq ≥0.3 in the association analysis.
Pre-phasing and imputation of the X chromosome were performed using the same software applied for autosomes. In each dataset, we phased the haplotypes for males and females together and then imputed them separately in each sex.
Phenotype and association analysis. For the single-variant association analysis, we obtained the residuals from a linear regression model of height adjusted for age, age 2 , and sex. The residuals were then converted into Z-scores. Association analysis of autosomes was performed using BOLT-LMM (ver.2.2), which implements linear mixed model analysis under the Gaussian mixture model 13 . Since this approach is not applicable to the X chromosome, we recalculated the residuals of the linear regression analysis by adding the top 10 principal components (PCs) of the genotypes as covariates and converted them into Z-scores. Association analysis of the X chromosome was carried out with mach2qtl 32 in the GWAS and with EPACTS (ver. 3.2.6) in the replication study. Since we did not remove related samples from the replication set, we additionally excluded samples based on a PI_HAT >0.175 and included 30,303 samples in the X-chromosomal analysis.
In the conditional analysis performed in the GWAS, we used BOLT-LMM for autosomes and mac2qtl for the X chromosome. We set the same significance threshold employed for the GWAS (α = 5.0 × 10 −8 ) in the conditional analysis and repeated the analysis until the association of the top variants fell below the significance threshold. To replicate the associations of variants identified through conditional analysis, we used the same set of conditioned variants employed in the GWAS as covariates in the replication set.
We integrated the results of the GWAS and the replication set using the inversevariance method and estimated heterogeneity with Cochran's Q test. The variants that satisfied genome-wide significance (P < 5.0 × 10 −8 ) after the meta-analysis were considered to be significantly associated throughout the study.
For the evaluation of gene-level associations, we conducted a group-wise association analysis of coding regions in autosomal genes. For phenotype preparation, we obtained the residuals of the linear regression of height with age, age 2 , sex, and the top 10 PCs as covariates and standardized them as Z-scores. We annotated the variants as described below and extracted nonsynonymous variants, including missense and nonsense variants, non-frameshift insertions and deletions, and frameshift variants with a MAF <5% and imputation quality of Rsq ≥0.3. A groupwise association test was performed using the imputed allelic dosage employing SKAT-O 19 implemented in EPACTS. As in the X-chromosome analysis, we excluded closely related samples from the replication set. We set a conservative significance threshold for the gene-level association analysis using Bonferroni correction based on the assumption that 20,000 genes could be evaluated (α = 0.05/20,000 genes). To evaluate whether the identified gene-level associations resulted from multiple nonsynonymous variants, we performed conditional analyses by adjusting for the best-associated variant of each gene included in the group-wise association analysis.
LD score regression. To assess bias resulted from population stratification and cryptic relatedness, we conducted a single variate LD score regression 14 analysis using LDSC v.1.0.0. We used the LD score file for East Asian population provided with the software.
Annotation. We annotated variants using ANNOVAR 34  Pleiotropy. To investigate the pleiotropic effects of the 64 identified variants with a MAF <5%, we evaluated the associations of 17 different phenotypes using the clinical information of the samples analyzed in the GWAS. We targeted the nine quantitative traits among hematological, kidney-related and liver-related traits and two diseases by considering the results of our genetic correlation analysis across 89 complex traits 16 . We additionally selected four lipid-related traits, hemoglobin A1c, and type 2 diabetes because height-associated rare and low-frequency variants showed significant associations with these traits in a previous study 7 . For quantitative traits, we used standardized phenotypes (detailed in Supplementary Table 6) and performed linear regression analysis. For the case-control analysis, we applied a logistic regression analysis adjusted for age, sex, and the top 10 PCs as covariates.
We also tested the pleiotropy of the identified genes (SLC27A3 and CYP26B1) for the same 17 traits. Gene-level associations were estimated with SKAT-O 19 using EPACTS (ver. 3.2.6), with the same settings applied for height. We set the significance threshold using Bonferroni correction at α = 0.05/17/64 for singlevariant analysis and α = 0.05/17/2 for the gene-based test.
Pathway analysis. We conducted a gene set enrichment analysis using PASCAL software 21 . To fit into the Japanese LD structure, we used a custom reference panel generated from the BBJ1K set. We used 14,462 gene sets reconstituted with DEPICT software 22 (Z-score >3; software downloaded from https://data. broadinstitute.org/mpg/depict/depict_download/reconstituted_genesets/).
The default settings were employed for the analysis. After determining significant gene sets by evaluating the FDR using the Benjamini-Hochberg method (threshold: FDR <0.05) for the P value estimated via the χ 2 pathway scoring method, we applied an AP clustering method 23 to merge the identified gene sets into functionally relevant categories and defined exemplars in each category. We used the Z-scores for all genes to perform this clustering.
GREML analysis. We employed Japanese population-based cohorts (replication set) to estimate the variance explained by the imputed variants. Following a previous report 24 , we used hard call genotypes (x dose ) estimated based on the imputed allelic dosage (x g ), without excluding variants with a lower imputation quality. Here, we converted x g = 0 for x dose < 0.5; x g = 2 for x dose > 1.5; and x g = 1 for 0.5 ≤ x dose ≤ 1.5; and we excluded variants with a minor allele count <3 and a P for Hardy-Weinberg <1.0 × 10 −6 . To estimate the variance explained by the variants contained in our reference panel, the GREML-MS 24 approach was applied using GCTA software 25 (ver. 1.25). We first generated a genetic relationship matrix (GRM) using all of the variants and estimated genetic relatedness and restricted samples with relatedness <0.05 using the "-grm-cutof'" option. As a result, 25,419 samples were included in the analysis. Trait values were re-standardized in these samples. Subsequently, we generated seven GRMs stratified by MAF as follows: (1) MAF <5%, (2) 5% ≤ MAF < 10%, (3) MAF <10%, (4) 10% ≤ MAF < 20%, (5) 20% ≤ MAF < 30%, (6) 30% ≤ MAF < 40%, and (7) 40% ≤ MAF. We estimated phenotypic variance according to each MAF strata using the "-reml" option with the top 10 PCs to control population stratification.
Mean allelic effect stratified by MAF. We evaluated the relationship between the MAF and the minor allele effect in our samples by utilizing the variants analyzed in the GWAS (Rsq ≥0.3). After excluding the variants that did not converge, we divided the autosomal variants into 100 bins according to their MAF. The number of the variants in each bin was determined as follows. First, we divided the number of all the analyzed variants by 100. The number of variants assigned in bins 1-99 was the integer quotient of the number of all the variants +1, and the remained variants were grouped into the bin 100. Then, we estimated Spearman's rank correlation coefficient (ρ) between the MAF and the minor allele effect in each bin. The statistical significance of a correlation coefficient value was determined via one million resamplings. In each resampling procedure, we randomly selected the same number of variants and estimated the empirical distribution of ρ values, as proposed previously 24 . For the purpose of replication, we analyzed samples in the replication set in the same way.