Height is a highly heritable, classic polygenic trait with approximately 700 common associated variants identified through genome-wide association studies so far. Here, we report 83 height-associated coding variants with lower minor-allele frequencies (in the range of 0.1–4.8%) and effects of up to 2 centimetres per allele (such as those in IHH, STC2, AR and CRISPLD2), greater than ten times the average effect of common variants. In functional follow-up studies, rare height-increasing alleles of STC2 (giving an increase of 1–2 centimetres per allele) compromised proteolytic inhibition of PAPP-A and increased cleavage of IGFBP-4 in vitro, resulting in higher bioavailability of insulin-like growth factors. These 83 height-associated variants overlap genes that are mutated in monogenic growth disorders and highlight new biological candidates (such as ADAMTS3, IL11RA and NOX4) and pathways (such as proteoglycan and glycosaminoglycan synthesis) involved in growth. Our results demonstrate that sufficiently large sample sizes can uncover rare and low-frequency variants of moderate-to-large effect associated with polygenic human phenotypes, and that these variants implicate relevant genes and pathways.
Human height is a highly heritable, polygenic trait1,2. The contribution of common DNA sequence variation to inter-individual differences in adult height has been systematically evaluated through genome-wide association studies (GWAS). This approach has thus far identified 697 independent variants located within 423 loci that together explain around 20% of the heritability of height3. As is typical of complex traits and diseases, most of the alleles that affect height that have been discovered so far are common (with a minor allele frequency (MAF) > 5%) and are mainly located outside coding regions, complicating the identification of the relevant genes or functional variants. Identifying coding variants associated with a complex trait in new or known loci has the potential to help pinpoint causal genes. Furthermore, the extent to which rare (MAF < 1%) and low-frequency (1% < MAF ≤ 5%) coding variants also influence complex traits and diseases remains an open question. Many recent DNA sequencing studies have identified only a few of these variants4,5,6,7,8, but this limited success could be due to their modest sample size9. Some studies have suggested that common sequence variants may explain the majority of the heritable variation in adult height10. It is therefore timely to assess whether and to what extent rare and low-frequency coding variations contribute to the genetic landscape of this model polygenic trait.
In this study, we used an ExomeChip11 to test the association between 241,453 variants (of which 83% are coding variants with a MAF ≤ 5%) and adult height variation in 711,428 individuals (discovery and validation sample sizes were 458,927 and 252,501, respectively). The ExomeChip is a genotyping array designed to query in very large sample sizes coding variants identified by whole-exome DNA sequencing of approximately 12,000 participants. The main goals of our project were to determine whether rare and low-frequency coding variants influence the architecture of a model complex human trait (in this case, adult height) and to discover and characterize new genes and biological pathways implicated in human growth.
Coding variants associated with height
We conducted single-variant meta-analyses in a discovery sample of 458,927 individuals, of whom 381,625 were of European ancestry. We validated our association results in an independent set of 252,501 participants. We first performed standard single-variant association analyses (Extended Data Figs 1, 2, 3 and Supplementary Tables 1–11; technical details of the discovery and validation steps are presented in the Methods). In total, we found 606 independent ExomeChip variants at array-wide significance (P < 2 × 10−7), including 252 non-synonymous or splice-site variants (Methods and Supplementary Table 11). Focusing on non-synonymous or splice-site variants with a MAF < 5%, our single-variant analyses identified 32 rare and 51 low-frequency height-associated variants (Extended Data Tables 1, 2). To our knowledge, these 83 height variants (MAF range of 0.1–4.8%) represent the largest set of validated rare and low-frequency coding variants associated with any complex human trait or disease to date. Among these 83 variants, there are 81 missense, one nonsense (in CCND3), and one essential acceptor splice site (in ARMC5) variants.
We observed a strong inverse relationship between MAF and effect size (Fig. 1). Although power limits our capacity to find rare variants with small effects, we know that common variants with effect sizes comparable to the largest seen in our study would have been easily discovered by prior GWAS, but were not detected. Our results agree with a model based on accumulating theoretical and empirical evidence that suggest that variants with strong phenotypic effects are more likely to be deleterious, and therefore rarer12,13. The largest effect sizes were observed for four rare missense variants, located in the androgen receptor gene AR (NCBI single nucleotide polymorphism (SNP) reference ID: rs137852591; MAF = 0.21%, Pcombined = 2.7 × 10−14), in CRISPLD2 (rs148934412; MAF = 0.08%, Pcombined = 2.4 × 10−20), in IHH (rs142036701, MAF = 0.08%, Pcombined = 1.9 × 10−23), and in STC2 (rs148833559, MAF = 0.1%, Pcombined = 1.2 × 10−30). Carriers of the rare STC2 missense variant are approximately 2.1 cm taller than non-carriers, whereas carriers of the remaining three variants (or hemizygous men that carry a rare X-linked AR allele at rs137852591) are approximately 2 cm shorter than non-carriers. By comparison, the mean effect size of common height alleles is ten times smaller in the same dataset. Across all 83 rare and low-frequency non-synonymous variants, the minor alleles were evenly distributed between height-increasing and height-decreasing effects (48% and 52%, respectively) (Fig. 1 and Extended Data Tables 1, 2).
Coding variants in new and known height loci
Many of the height-associated variants discovered in this study are located near common variants previously associated with height. Of the 83 rare and low-frequency non-synonymous variants, 2 low-frequency missense variants were previously identified (in CYTL1 and IL11)3,14 and 47 fell within 1 Mb of a known height signal; the remaining 34 define new loci. We used conditional analysis of the UK Biobank dataset and confirmed that 38 of these 47 variants were independent of the previously described height SNPs (Supplementary Table 12). We validated the UK Biobank conditional results using an orthogonal imputation-based methodology implemented in the full discovery set (Extended Data Fig. 4 and Supplementary Table 12). In addition, we found a further 85 common variants and one low-frequency synonymous variant (in ACHE) that define novel loci (Supplementary Table 12). Thus, our study identified a total of 120 new height-associated loci (Supplementary Table 11).
We used the UK Biobank dataset to estimate the contribution of the new height variants to heritability, which is h2 ≈ 80% for adult height2. In combination, the 83 rare and low-frequency variants explained 1.7% of the heritability of height. The newly identified novel common variants accounted for another 2.4% and all independent variants, known and novel, together explained 27.4% of heritability. By comparison, the 697 known height-associated SNPs explain 23.3% of height heritability in the same dataset (versus the 4.1% explained by the new height-associated variants identified in this study). We observed a modest positive association between MAF and heritability for each variant (P = 0.012, Extended Data Fig. 5), with each common variant explaining slightly more heritability than rare or low-frequency variants (0.036% versus 0.026%, Extended Data Fig. 5).
Gene-based association results
To increase the power to find rare or low-frequency coding variants associated with height, we performed gene-based analyses (Methods and Supplementary Tables 13–15). After accounting for gene-based signals explained by a single variant driving the association statistics, we identified ten genes with P < 5 × 10−7 that harboured more than one coding variant independently associated with height variation (Supplementary Tables 16, 17). These gene-based results remained significant after conditioning on genotypes at nearby common height-associated variants present on the ExomeChip (Table 1). Using the same gene-based tests in an independent dataset of 59,804 individuals genotyped on the same exome array, we replicated three genes at P < 0.05 (Table 1). Further evidence for replication in these genes was seen at the level of single variants (Supplementary Table 18). From the gene-based results, three genes—CSAD, NOX4, and UGGT2—are outside of the loci found by single-variant analyses and are implicated in human height for the first time to our knowledge.
Coding variants implicate pathways in skeletal growth
Previous pathway analyses of height loci identified by GWAS have highlighted gene sets related to both general biological processes (such as chromatin modification and regulation of embryonic size) and skeletal-growth-specific pathways (such as chondrocyte biology, extracellular matrix and skeletal development)3. We used two different methods, DEPICT15 and PASCAL16 (see Methods), to perform pathway analyses using the ExomeChip results to test whether coding variants could independently confirm the relevance of these previously highlighted pathways (and further implicate specific genes in these pathways) or identify new pathways. To compare the pathways emerging from coding and non-coding variation, we applied DEPICT separately onto exome-array-wide associated coding variants independent of known GWAS signals and onto non-coding GWAS loci, excluding all novel height-associated genes implicated by coding variants. We identified a total of 496 and 1,623 enriched gene sets, respectively, at a false discovery rate < 1% (Supplementary Tables 19, 20); similar analyses with PASCAL yielded 362 and 278 enriched gene sets, respectively (Supplementary Tables 21, 22). Comparison of the results revealed a high degree of shared biology for coding and non-coding variants (for DEPICT, gene set P values compared between coding and non-coding results had a Pearson’s r = 0.583, P < 2.2 × 10−16; for PASCAL, Pearson’s r = 0.605, P < 2.2 × 10−16). However, some pathways were more strongly enriched for either coding or non-coding genetic variation. In general, coding variants more strongly implicated pathways specific to skeletal growth (such as extracellular matrix and bone growth), whereas GWAS signals highlighted more global biological processes (such as transcription factor binding and embryonic size or lethality) (Extended Data Fig. 6). The two significant gene sets identified by DEPICT and PASCAL that uniquely implicated coding variants were the BCAN protein–protein interaction sub-network and the proteoglycan-binding set. Both of these pathways relate to the biology of proteoglycans, which are proteins (such as aggrecan) that contain glycosaminoglycans (such as chrondroitin sulfate) and that have well established connections to skeletal growth17.
We also investigated which height-associated genes identified by ExomeChip analyses were driving enrichment of pathways such as proteoglycan binding. Using unsupervised clustering analysis, we observed that a cluster of 15 height-associated genes was strongly implicated in a group of correlated pathways that include biology related to proteoglycans and glycosaminoglycans (Fig. 2 and Extended Data Fig. 7). Seven of these 15 genes overlap a previously curated list of 277 genes annotated in OMIM (http://omim.org/) as causing skeletal growth disorders3; genes in this small cluster are enriched for OMIM annotations relative to genes outside the cluster (odds ratio = 27.6, Fisher's exact P = 1.1 × 10−5). As such, the remaining genes in this cluster may harbour variants that cause Mendelian growth disorders. Within this group are genes that are largely uncharacterized (SUSD5), have relevant biochemical functions (GLT8D2, a glycosyltransferase studied mostly in the context of the liver18; LOXL4, a lysyl oxidase expressed in cartilage19), modulate pathways known to affect skeletal growth (FIBIN, SFRP4)20,21 or lead to increased body length when knocked out in mice (SFRP4)22.
Functional characterization of rare STC2 variants
To investigate whether the identified rare coding variants affect protein function, we performed in vitro functional analyses of two rare coding variants in a particularly compelling and novel candidate gene, STC2. Overexpression of STC2 diminishes growth in mice by covalent binding to and inhibition of the proteinase PAPP-A, which specifically cleaves insulin growth factor binding protein 4 (IGFBP-4), leading to reduced levels of bioactive insulin-like growth factors23 (Fig. 3a). Although there was no prior genetic evidence implicating STC2 variation in human growth, the PAPPA and IGFBP4 genes have both been implicated in height GWAS3, and rare mutations in PAPPA2 cause severe short stature24, emphasizing the likely relevance of this pathway in humans. The two STC2 height-associated variants are rs148833559 (p.R44L, MAF = 0.096%, Pdiscovery = 5.7 × 10−15) and rs146441603 (p.M86I, MAF = 0.14%, Pdiscovery = 2.1 × 10−5). These rare alleles increase height by an average of 1.9 and 0.9 cm, respectively, suggesting that they both partially impair STC2 activity. In functional studies, STC2 variants with these amino acid substitutions were expressed at similar levels to wild-type STC2, but showed clear, partial defects in binding to PAPP-A and in inhibition of PAPP-A-mediated cleavage of IGFBP-4 (Fig. 3b–d). Thus, the genetic analysis successfully identified rare coding alleles that have demonstrable and predicted functional consequences, strongly confirming the role of these variants and the STC2 gene in human growth.
Previous GWAS studies have reported pleiotropic or secondary effects on other phenotypes for many common variants associated with adult height3,25. Using association results from 17 human complex phenotypes for which well-powered meta-analysis results are available, we investigated whether rare and low-frequency height variants are also pleiotropic. We found one rare and five low-frequency missense variants associated with at least one of the other investigated traits at array-wide significance (P < 2 × 10−7) (Extended Data Fig. 8 and Supplementary Table 23). The minor alleles at rs77542162 (ABCA6, MAF = 1.7%) and rs28929474 (SERPINA1, MAF = 1.8%) are associated with increased height and increased levels of low-density lipoprotein (LDL) cholesterol and total cholesterol, whereas the minor allele at rs3208856 in CBLC (MAF = 3.4%) is associated with increased height, high-density lipoprotein (HDL) cholesterol and triglyceride, but decreased LDL cholesterol and total cholesterol levels. The minor allele at rs141845046 (ZBTB7B, MAF = 2.8%) was associated with both increased height and body mass index (BMI). The minor alleles at the other two missense variants associated with shorter stature, rs201226914 in PIEZO1 (MAF = 0.2%) and rs35658696 in PAM (MAF = 4.8%), were associated with decreased glycated haemoglobin (HbA1c) and increased risk of type 2 diabetes (T2D), respectively.
We undertook an association study of nearly 200,000 coding variants in 711,428 individuals, and identified 32 rare and 51 low-frequency coding variants associated with adult height. Furthermore, gene-based testing discovered 10 genes that harbour several additional rare or low-frequency variants associated with height, including three genes (CSAD, NOX4 and UGGT2) in loci not previously implicated in height. Given the design of the ExomeChip, which did not consider variants with a MAF < 0.004% (corresponding to approximately one allele in 12,000 participants), our gene-based association results do not rule out the possibility that additional genes with such rarer coding variants also contribute to height variation; deep DNA sequencing in very large sample sizes will be required to address this question. In total, our results highlight 89 genes (10 from gene-based testing and 79 from single-variant analyses (4 genes have 2 independent coding variants)) that are likely to modulate human growth, and 24 alleles segregating in the general population that affect height by more than 1 cm (Table 1 and Extended Data Tables 1, 2). The rare and low-frequency coding variants explain 1.7% of the heritable variation in adult height. When considering all rare, low-frequency and common height-associated variants validated in this study, we can now explain 27.4% of the heritability of height.
Our analyses revealed many coding variants in genes mutated in monogenic skeletal growth disorders, confirming the presence of allelic series (from familial penetrant mutations to mild effect common variants) in the same genes for related growth phenotypes in humans. We used gene-set-enrichment-type analyses to demonstrate the functional connectivity between the genes that harbour coding height variants, highlighting both known and novel biological pathways that regulate height in humans (Fig. 2, Extended Data Fig. 7 and Supplementary Tables 19–22), and implicating genes such as SUSD5, GLT8D2, LOXL4, FIBIN and SFRP4 that have not been previously connected with skeletal growth. Additional noteworthy height candidate genes include NOX4, ADAMTS3, ADAMTS6, PTH1R and IL11RA (Extended Data Tables 1, 2 and Supplementary Tables 17, 24). NOX4, identified through gene-based testing, encodes NADPH oxidase 4, an enzyme that produces reactive oxygen species, a biological pathway not previously implicated in human growth. Nox4−/− mice display higher bone density and a reduced number of osteoclasts, a cell type that is essential for bone repair, maintenance and remodelling12. We also found rare coding variants in ADAMTS3 and ADAMTS6, genes that encode metalloproteinases that belong to the same family as several other human growth syndromic genes (such as ADAMTS2, ADAMTS10 and ADAMTSL2). Moreover, we discovered a rare missense variant in PTH1R that encodes a receptor for parathyroid hormone; parathyroid hormone–PTH1R signalling is important for bone resorption, and mutations in PTH1R cause chondrodysplasia in humans26. Finally, we replicated the association between a low-frequency missense variant in the cytokine gene IL11, but also found a low-frequency missense variant in the gene encoding its receptor, IL11RA. The IL11–IL11RA axis has been shown to play an important role in bone formation in the mouse27,28. Thus, our data confirm that this signalling cascade is also relevant in human growth.
Overall, our findings provide strong evidence that rare and low-frequency coding variants contribute to the genetic architecture of height, a model complex human trait. This conclusion has implications for the prediction of complex human phenotypes in the context of precision medicine initiatives. Although rare, large effect-size variants might not explain most of the heritable disease risk at the population level, they are important for predicting the risk of disease development for the individuals that carry them. Our findings also seem to contrast markedly with results from the recent large-scale T2D association study, which found only six variants with a MAF < 5% (ref. 29.). This apparent difference could be explained simply by the large difference in sample sizes between the two studies (711,428 for height versus 127,145 for T2D). When we consider the fraction of associated variants with a MAF < 5% among all confirmed variants for height and T2D, we find that it is similar (9.7% for height versus 7.1% for T2D). This supports the strong probability that rarer T2D alleles and, more generally, rarer alleles for other polygenic diseases and traits will be uncovered as sample sizes continue to increase.
Study design and participants
The discovery cohort consisted of 147 studies comprising 458,927 adult individuals of the following ancestries: (1) European descent (n = 381,625); (2) African (n = 27,494); (3) South Asian (n = 29,591); (4) East Asian (n = 8,767); (5) Hispanic (n = 10,776) and (6) Saudi Arabian (n = 695). All participating institutions and coordinating centres approved this project, and informed consent was obtained from all subjects. Discovery meta-analysis was carried out in each ancestry group (except the Saudi Arabian) separately as well as in the All group. Validation was undertaken in individuals of European ancestry only (Supplementary Tables 1–3). Conditional analyses were undertaken only in the European descent group (106 studies, n = 381,625). The SNPs we identify are available from the NCBI dbSNP database of short genetic variations (https://www.ncbi.nlm.nih.gov/projects/SNP/). No statistical methods were used to predetermine sample size. The experiments were not randomized and the investigators were not blinded to allocation during experiments and outcome assessment.
Height (in centimetres) was corrected for age and the genomic principal components (derived from GWAS data, the variants with a MAF > 1% on ExomeChip (http://genome.sph.umich.edu/wiki/Exome_Chip_Design), or ancestry-informative markers available on the ExomeChip), as well as any additional study-specific covariates (for example, recruiting centre), in a linear regression model. For studies with non-related individuals, residuals were calculated separately by sex, whereas for family-based studies sex was included as a covariate in the model. Additionally, residuals for case/control studies were calculated separately. Finally, residuals were subject to inverse normal transformation.
The majority of studies followed a standardized protocol and performed genotype calling using the designated manufacturer’s software, which was then followed by zCall30. For ten studies participating in the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium, the raw intensity data for the samples from seven genotyping centres were assembled into a single project for joint calling11. Study-specific quality-control measures of the genotyped variants was implemented before association analysis (Supplementary Tables 1–2).
Study-level statistical analyses
Individual cohorts were analysed separately for each ancestry population, with either RAREMETALWORKER (http://genome.sph.umich.edu/wiki/RAREMETALWORKER) or RVTEST (http://zhanxw.github.io/rvtests/), to associate inverse normal transformed height data with genotype data taking potential cryptic relatedness (kinship matrix) into account in a linear mixed model. These software are designed to perform score-statistics based rare-variant association analysis, can accommodate both unrelated and related individuals, and provide single-variant results and variance-covariance matrix. The covariance matrix captures linkage disequilibrium relationships between markers within 1 Mb, which is used for gene-level meta-analyses and conditional analyses31. Single-variant analyses were performed for both additive and recessive models (for the alternate allele).
Centralized quality control
The individual study data were investigated for potential existence of ancestry population outliers based on the 1000 Genome Project phase 1 ancestry reference populations. A centralized quality control procedure implemented in EasyQC32 was applied to individual study association summary statistics to identify outlying studies: (1) assessment of possible problems in height transformation; (2) comparison of allele frequency alignment against 1000 Genomes Project phase 1 reference data to pinpoint any potential strand issues; and (3) examination of quantile–quantile plots per study to identify any problems arising from population stratification, cryptic relatedness and genotype biases. We excluded variants if they had a call rate <95%, Hardy–Weinberg equilibrium P < 1 × 10−7, or large allele frequency deviations from reference populations (>0.6 for all ancestry analyses and >0.3 for ancestry-specific population analyses). We also excluded from downstream analyses markers not present on the Illumina ExomeChip array 1.0, variants on the Y chromosome or the mitochondrial genome, indels, multiallelic variants, and problematic variants based on the Blat-based sequence alignment analyses. Meta-analyses were carried out in parallel by two different analysts at two sites.
We conducted single-variant meta-analyses in a discovery sample of 458,927 individuals of different ancestries using both additive and recessive genetic models (Extended Data Fig. 1 and Supplementary Tables 1–4). Significance for single-variant analyses was defined at an array-wide level (P < 2 × 10−7, Bonferroni correction for 250,000 variants). The combined additive analyses identified 1,455 unique variants that reached array-wide significance (P < 2 × 10−7), including 578 non-synonymous and splice-site variants (Supplementary Tables 5–7). Under the additive model, we observed a high genomic inflation of the test statistics (for example, a λGC of 2.7 in European ancestry studies for common markers, Extended Data Fig. 2 and Supplementary Table 8), although validation results (see below) and additional sensitivity analyses (see below) suggested that it is consistent with polygenic inheritance as opposed to population stratification, cryptic relatedness, or technical artefacts (Extended Data Fig. 2). The majority of these 1,455 association signals (1,241; 85.3%) were found in the European ancestry meta-analysis (85.5% of the discovery sample size) (Extended Data Fig. 2). Nevertheless, we discovered eight associations within five loci in our all-ancestry analyses that are driven by African studies (including one missense variant in the growth hormone gene GH1 (rs151263636), Extended Data Fig. 3), three height variants found only in African studies, and one rare missense marker associated with height in South Asians only (Supplementary Table 7).
Genomic inflation and confounding
We observed a marked genomic inflation of the test statistics even after adequate control for population stratification (linear mixed model) arising mainly from common markers; λGC in European ancestry was 1.2 and 2.7 for all and common markers, respectively (Extended Data Fig. 2 and Supplementary Table 8). Such inflation is expected for a highly polygenic trait like height, and is consistent with our very large sample size3,33. To confirm this, we applied the recently developed linkage disequilibrium score regression method to our height ExomeChip results34, with the caveats that the method was developed (and tested) with >200,000 common markers available. We restricted our analyses to 15,848 common variants (MAF ≥ 5%) from the European-ancestry meta-analysis, and matched them to pre-computed linkage disequilibrium scores for the European reference dataset34. The intercept of the regression of the χ2 statistics from the height meta-analysis on the linkage disequilibrium score estimates that the inflation in the mean χ2 is due to confounding bias, such as cryptic relatedness or population stratification. The intercept was 1.4 (s.e.m. = 0.07), which is small when compared to the λGC of 2.7. Furthermore, we also confirmed that the linkage disequilibrium score regression intercept is estimated upward because of the small number of variants on the ExomeChip and the selection criteria for these variants (that is, known GWAS hits). The ratio statistic of (intercept − 1)/(mean χ2 − 1) is 0.067 (s.e.m. = 0.012), well within the normal range34, suggesting that most of the inflation (~93%) observed in the height association statistics is due to polygenic effects (Extended Data Fig. 2).
Furthermore, to exclude the possibility that some of the observed associations between height and rare and low-frequency variants could be due to allele calling problems in the smaller studies, we performed a sensitivity meta-analysis with primarily European ancestry studies totalling >5,000 participants. We found very concordant effect sizes, suggesting that smaller studies do not bias our results (Extended Data Fig. 2).
The RAREMETAL R package35 and the GCTA v1.24 (ref. 36) software were used to identify independent height association signals across the European descent meta-analysis results. RAREMETAL performs conditional analyses by using covariance matrices in order to distinguish true signals from those driven by linkage disequilibrium at adjacent known variants. First, we identified the lead variants (P < 2 × 10−7) based on a 1-Mb window centred on the most significantly associated variant and performed linkage disequilibrium pruning (r2 < 0.3) to avoid downstream problems in the conditional analyses due to co-linearity. We then conditioned on the linkage disequilibrium-pruned set of lead variants in RAREMETAL and kept new lead signals at P < 2 × 10−7. The process was repeated until no additional signal emerged below the pre-specified P-value threshold. The use of a 1-Mb window in RAREMETAL can obscure dependence between conditional signals in adjacent intervals in regions of extended linkage disequilibrium. To detect such instances, we performed joint analyses using GCTA with the ARIC and UK ExomeChip reference panels, both of which comprise >10,000 individuals of European descent. With the exception of a handful of variants in a few genomic regions with extended linkage disequilibrium (for example, the HLA region on chromosome 6), the two pieces of software identified the same independent signals (at P < 2 × 10−7).
To discover new height variants, we conditioned the height variants found in our ExomeChip study on the previously published GWAS height variants3 using the first release of the UK Biobank imputed dataset and regression methodology implemented in BOLT-LMM37. Because of the difference between the sample size of our discovery set (n = 458,927) and the UK Biobank (first release, n = 120,084), we applied a threshold of Pconditional < 0.05 to declare a height variant as independent in this analysis. We also explored an alternative approach based on approximate conditional analysis36. This latter method (SSimp) relies on summary statistics available from the same cohort, thus we first imputed summary statistics38 for exome variants, using summary statistics from a previous study3. Conversely, we imputed the top variants from this study3 using the summary statistics from the ExomeChip. Subsequently, we calculated effect sizes for each exome variant conditioned on the top variants of this study3 in two ways. First, we conditioned the imputed summary statistics of the exome variant on the summary statistics of the top variants that fell within 5 Mb of the target ExomeChip variant. Second, we conditioned the summary statistics of the ExomeChip variant on the imputed summary statistics of the hits of this study3. We then selected the option that yielded a higher imputation quality. For poorly tagged variants ( < 0.8), we simply used up-sampled HapMap summary statistics for the approximate conditional analysis. Pairwise SNP-by-SNP correlations were estimated from the UK10K data (TwinsUK39 and ALSPAC40 studies, n = 3,781).
Validation of the single-variant discovery results
Several studies, totalling 252,501 independent individuals of European ancestry, became available after the completion of the discovery analyses, and were thus used for validation of our experiment. We validated the single-variant association results in eight studies, totalling 59,804 participants, genotyped on the ExomeChip using RAREMETAL31. We sought additional evidence for association for the top signals in two independent studies in the UK (UK Biobank) and Iceland (deCODE), comprising 120,084 and 72,613 individuals, respectively. We used the same quality control and analytical methodology as described above. Genotyping and study descriptions are provided in Supplementary Tables 1–3. For the combined analysis, we used the inverse-variance-weighted fixed effects meta-analysis method using METAL41. Significant associations were defined as those with a combined meta-analysis (discovery and validation) Pcombined < 2 × 10−7.
We considered 81 variants with suggestive association in the discovery analyses (2 × 10−7 < Pdiscovery ≤ 2 × 10−6). Of those 81 variants, 55 reached significance after combining discovery and replication results based on a Pcombined < 2 × 10−7 (Supplementary Table 9). Furthermore, recessive modelling confirmed seven new independent markers with a Pcombined < 2 × 10−7 (Supplementary Table 10). One of these recessive signals is due to a rare X-linked variant in the AR gene (rs137852591, MAF = 0.21%). Because of its frequency, we only tested hemizygous men (we did not identify homozygous women for the minor allele) so we cannot distinguish between a true recessive mode of inheritance or a sex-specific effect for this variant. To test the independence and integrate all height markers from the discovery and validation phase, we used conditional analyses and GCTA ‘joint’ modelling36 in the combined discovery and validation set. This resulted in the identification of 606 independent height variants, including 252 non-synonymous or splice-site variants (Supplementary Table 11). If we consider only the initial set of lead SNPs with P < 2 × 10−7, we identified 561 independent variants. Of these 561 variants (selected without the validation studies), 560 have concordant direction of effect between the discovery and validation studies, and 548 variants have a Pvalidation < 0.05 (466 variants with Pvalidation < 8.9 × 10−5, Bonferroni correction for 561 tests), suggesting a very low false discovery rate (Supplementary Table 11).
Gene-based association meta-analyses
For the gene-based analyses, we applied two different sets of criteria to select variants, based on coding variant annotation from five prediction algorithms (PolyPhen2 HumDiv and HumVar, LRT, MutationTaster and SIFT)42. The mask labelled ‘broad’ included variants with a MAF < 0.05 that are nonsense, stop-loss, splice site, as well as missense variants that are annotated as damaging by at least one program mentioned above. The mask labelled ‘strict’ included only variants with a MAF < 0.05 that are nonsense, stop-loss, splice-site, as well as missense variants annotated as damaging by all five algorithms. We used two tests for gene-based testing, namely the SKAT43 and VT44 tests. Statistical significance for gene-based tests was set at a Bonferroni-corrected threshold of P < 5 × 10−7 (threshold for 25,000 genes and four tests). The gene-based discovery results were validated (same test and variants, when possible) in the same eight studies genotyped on the ExomeChip (n = 59,804 participants) that were used for the validation of the single-variant results (see above, and Supplementary Tables 1–3). Gene-based conditional analyses were performed in RAREMETAL.
We accessed ExomeChip data from GIANT (BMI, waist:hip ratio), GLGC (total cholesterol, triglycerides, HDL-cholesterol, LDL-cholesterol), IBPC (systolic and diastolic blood pressure), MAGIC (glycaemic traits), REPROGEN (age at menarche and menopause), and DIAGRAM (type 2 diabetes) consortia. For coronary artery disease, we accessed 1000 Genomes Project-imputed GWAS data released by CARDIoGRAMplusC4D45.
DEPICT (http://www.broadinstitute.org/mpg/depict/) is a computational framework that uses probabilistically defined reconstituted gene sets to perform gene set enrichment and gene prioritization15. For a description of gene set reconstitution, refer to refs 15, 46. In brief, reconstitution was performed by extending pre-defined gene sets (such as Gene Ontology terms, canonical pathways, protein-protein interaction subnetworks and rodent phenotypes) with genes co-regulated with genes in these pre-defined gene set using large-scale microarray-based transcriptomics data. In order to adapt the gene set enrichment part of DEPICT for ExomeChip data (https://github.com/RebeccaFine/height-ec-depict), we made two principal changes. First, because DEPICT for GWAS incorporates all genes within a given linkage disequilibrium block around each index SNP, we modified DEPICT to take as input only the gene directly impacted by the coding SNP. Second, we adapted the way DEPICT adjusts for confounders (such as gene length) by generating null ExomeChip association results using Swedish ExomeChip data (Malmö Diet and Cancer (MDC), All New Diabetics in Scania (ANDIS), and Scania Diabetes Registry (SDR) cohorts, n = 11,899) and randomly assigning phenotypes from a normal distribution before conducting association analysis (see Supplementary Information). For the gene set enrichment analysis of the ExomeChip data, we used significant non-synonymous variants statistically independent of known GWAS hits (and that were present in the null ExomeChip data; see Supplementary Information for details). For gene set enrichment analysis of the GWAS data, we used all loci with a non-coding index SNP and that did not contain any of the novel ExomeChip genes. In visualizing the analysis, we used affinity propagation clustering47 to group the most similar reconstituted gene sets based on their gene memberships (see Supplementary Information). Within a ‘meta-gene set’, the best P value of any member gene set was used as representative for comparison. DEPICT for ExomeChip was written using the Python programming language and the code can be found at https://github.com/RebeccaFine/height-ec-depict.
We also applied the PASCAL (http://www2.unil.ch/cbg/index.php?title=Pascal) pathway analysis tool16 to association summary statistics for all coding variants. In brief, the method derives gene-based scores (both SUM and MAX statistics) and subsequently tests for the over-representation of high gene scores in predefined biological pathways. We used standard pathway libraries from KEGG, REACTOME and BIOCARTA, and also added dichotomized (Z score > 3) reconstituted gene sets from DEPICT15. To accurately estimate SNP-by-SNP correlations even for rare variants, we used the UK10K data (TwinsUK39 and ALSPAC40 studies, n = 3781). To separate the contribution of regulatory variants from the coding variants, we also applied PASCAL to association summary statistics of only regulatory variants (20 kb upstream, gene body excluded) from a previous study3. In this way, we could classify pathways driven principally by coding, regulatory or mixed signals.
STC2 functional experiments
For the generation of STC2 mutants (R44L and M86I), wild-type STC2 cDNA contained in pcDNA3.1/Myc-His(−) (Invitrogen)23 was used as a template. Mutagenesis was carried out using Quickchange (Stratagene), and all constructs were verified by sequence analysis. Recombinant wild-type STC2 and variants were expressed in human embryonic kidney (HEK) 293T cells (293tsA1609neo, ATCC CRL-3216) maintained in high-glucose DMEM supplemented 10% fetal bovine serum, 2 mM glutamine, nonessential amino acids, and gentamicin. The cells are routinely tested for mycoplasma contamination. Cells (6 × 106) were plated onto 10-cm dishes and transfected 18 h later by calcium phosphate co-precipitation using 10 μg plasmid DNA. Medium was collected 48 h after transfection, cleared by centrifugation, and stored at −20 °C until use. Protein concentrations (58–66 nM) were determined by TRIFMA using antibodies described previously23. PAPP-A was expressed stably in HEK293T cells as previously reported48. Expressed levels of PAPP-A (27.5 nM) were determined by a commercial ELISA (AL-101, Ansh Labs).
Culture supernatants containing wild-type STC2 or variants were adjusted to 58 nM, added an equal volume of culture supernatant containing PAPP-A corresponding to a 2.1-fold molar excess, and incubated at 37 °C. Samples were taken at 1, 2, 4, 6, 8, 16, and 24 h and stored at −20 °C.
Specific proteolytic cleavage of 125I-labeled IGFBP-4 is described in detail elsewhere49. In brief, the PAPP-A–STC2 complex mixtures were diluted (1:190) to a concentration of 72.5 pM PAPP-A and mixed with pre-incubated 125I-IGFBP4 (10 nM) and IGF-1 (100 nM) in 50 mM Tris-HCl, 100 mM NaCl, 1 mM CaCl2. Following 1 h incubation at 37 °C, reactions were terminated by the addition of SDS–PAGE sample buffer supplemented with 25 mM EDTA. Substrate and co-migrating cleavage products were separated by 12% non-reducing SDS–PAGE and visualized by autoradiography using a storage phosphor screen (GE Healthcare) and a Typhoon imaging system (GE Healthcare). Band intensities were quantified using ImageQuant TL 8.1 software (GE Healthcare).
STC2 and covalent complexes between STC2 and PAPP-A were blotted onto PVDF membranes (Millipore) following separation by 3–8% SDS–PAGE. The membranes were blocked with 2% Tween-20, and equilibrated in 50 mM Tris-HCl, 500 mM NaCl, 0.1% Tween-20; pH 9 (TST). For STC2, the membranes were incubated with goat polyclonal anti-STC2 (R&D systems, AF2830) at 0.5 μg ml−1 in TST supplemented with 2% skimmed milk for 1 h at 20 °C. For PAPP-A–STC2 complexes, the membranes were incubated with rabbit polyclonal anti-PAPP-A50 at 0.63 μg ml−1 in TST supplemented with 2% skimmed milk for 16 h at 20 °C. Membranes were washed with TST and subsequently incubated with polyclonal rabbit anti-goat IgG[en rule]horseradish peroxidase (DAKO, P0449) or polyclonal swine anti-rabbit IgG[en rule]horseradish peroxidase (DAKO, P0217), respectively, diluted 1:2,000 in TST supplemented with 2% skimmed milk for 1 h at 20 °C. Following washing with TST, membranes were developed using enhanced chemiluminescence (ECL Prime, GE Healthcare). Images were captured using an ImageQuant LAS 4000 instrument (GE Healthcare).
Summary genetic association results are available on the GIANT website (http://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium).
A full list of acknowledgments appears in the Supplementary Information. Part of this work was conducted using the UK Biobank resource.
Extended data figures
Extended data tables
This file contains Supplementary Tables 1-24.