Sixteen new lung function signals identified through 1000 Genomes Project reference panel imputation

Lung function measures are used in the diagnosis of chronic obstructive pulmonary disease. In 38,199 European ancestry individuals, we studied genome-wide association of forced expiratory volume in 1 s (FEV1), forced vital capacity (FVC) and FEV1/FVC with 1000 Genomes Project (phase 1)-imputed genotypes and followed up top associations in 54,550 Europeans. We identify 14 novel loci (P<5 × 10−8) in or near ENSA, RNU5F-1, KCNS3, AK097794, ASTN2, LHX3, CCDC91, TBX3, TRIP11, RIN3, TEKT5, LTBP4, MN1 and AP1S2, and two novel signals at known loci NPNT and GPR126, providing a basis for new understanding of the genetic determinants of these traits and pulmonary diseases in which they are altered.

L ung function, as measured by spirometry, predicts morbidity and mortality 1,2 . Altered lung function is a key criterion for the diagnosis of chronic obstructive pulmonary disease (COPD), a leading cause of death worldwide 3,4 . The ratio of forced expiratory volume in 1 s (FEV 1 ) over forced vital capacity (FVC) defines patients with airflow obstruction, while FEV 1 is used to assess the severity of the obstruction. Reduced FVC values are seen in restrictive lung diseases such as pulmonary fibrosis 5 . While environmental risk factors, such as tobacco smoking or air pollution, play a significant role in determining lung function 6,7 , genetic factors are also important contributors, with estimates of heritability ranging between 39 and 54% (refs 8,9).
Genome-wide association studies (GWAS) of around 2.5 million common (minor allele frequency (MAF)45%) singlenucleotide polymorphisms (SNPs) in Europeans have identified 32 loci associated with lung function at genome-wide significance level (Po5 Â 10 À 8 ) [10][11][12][13][14] . However, as for other complex traits 15,16 , these loci only explain a limited proportion of the heritability 11,13 . Among explanations for the 'missing heritability' are a large number of, as yet, undetected common variants with modest effect sizes, in addition to low-frequency (1%oMAFr5%) and rare (MAFr1%) variants with larger effect sizes 16,17 . Of particular relevance to low-frequency variants, phase 1 of the 1000 Genomes Project 18 sequenced 1,092 individuals from 14 populations, providing an imputation reference panel of B38 million SNPs and 1.4 million indels, including autosomal and X chromosome variants.
The aim of the current study, undertaken within the SpiroMeta consortium, was to improve coverage of low-frequency variants and detect novel loci associated with lung function by undertaking imputation of GWAS data to the 1000 Genomes Project 18 Phase-1 reference panel in 38,199 individuals of European ancestry. We meta-analysed GWAS results across 17 studies and followed up the most significant associations with in silico data in up to 54,550 Europeans. We identify 14 new loci associated with lung function at genome-wide significance level, and novel distinct signals at two previously reported loci. These include two low-frequency variant association signals, which seem to be explained by non-synonymous SNPs. The results of these analyses implicate both previously considered and novel mechanisms influencing lung function.

Results
We undertook a meta-analysis of 17 GWAS imputed using the 1000 Genomes Project 18 Phase-1 reference panel in a study of 38,199 individuals of European ancestry in stage 1 (Fig. 1), of which 19,532 were individuals not included in the discovery stage of previous meta-analyses of lung GWAS [10][11][12][13] . Characteristics of cohort participants, genotyping and imputation are shown in Supplementary Table 1. Each study adjusted FEV 1 , FEV 1 /FVC and FVC, for age, age 2 , sex, height and principal components for population structure, separately for never and ever smokers. Fourteen studies additionally undertook analyses for X chromosome variants (33,009 individuals, Supplementary Fig. 1 and Methods). Inverse normally transformed residuals were then used for association testing within each smoking stratum, assuming an additive genetic effect. Within each study, we combined smoking strata association summary statistics using inverse variance-weighted fixed-effects meta-analysis, and applied genomic control 19 to account for residual population structure not accounted for by principal components. We subsequently combined study-specific estimates across studies using inverse variance weighing, and applied genomic control 19 after fixedeffects meta-analysis. The genomic inflation factor across autosomal variants was 1.03 for each of the three traits, and across X chromosome variants was 1.04 for FEV 1 and 1.00 for FEV 1 /FVC and FVC. Quantile-quantile plots are presented in Supplementary Fig. 2a. Variants with effective sample sizes (N effective, product of sample size and imputation quality summed across studies) o70% were filtered out, and a total of 8,694,268 variants were included in this genome-wide study.
Forty-eight SNPs and seven indels in independent autosomal chromosome regions (±500 kb either side of sentinel variant) with stage 1 Po5 Â 10 À 6 were followed up in stage 2 using in silico data from four studies comprising 54,550 individuals ( Fig. 1; Supplementary Table 2). One SNP on the X chromosome also met these criteria and was followed up in a subset of three studies comprising 52,359 individuals (Supplementary Fig. 1; Supplementary Table 2). Characteristics of follow-up (stage 2) cohort participants, genotyping and imputation are shown in Supplementary Table 1. Stage-2 studies adjusted the traits for age, age 2 , sex, height and principal components to account for population structure and ever-smoking status, and also undertook association testing on the inverse normally transformed residuals assuming additive genetic effects. Stage-2 estimates were combined across studies, and then with stage-1 estimates, using inverse variance-weighted fixed-effects meta-analysis. Thirteen SNPs and three indels, each representing new signals of association, met a genome-wide significance threshold corrected for multiple testing (Po5 Â 10 À 8 ) after combining stage-1 and stage-2 results (Table 1; Fig. 2), of which 10 SNPs and three indels achieved independent replication meeting a Bonferroni-corrected threshold for 56 tests (Po8.93 Â 10 À 4 ) in stage 2 alone.
Sixteen novel association signals for FEV 1 , FEV 1 /FVC and FVC. Of the 16 novel signals reaching genome-wide significance, two represent distinct new signals for FEV 1 /FVC in previously reported loci 10,12 (stage-1 P value conditioned on previously reported variant o5 Â 10 À 6 ). Among the remaining 14, five new loci were identified for FEV 1 , six new loci for FEV 1 /FVC and three new loci for FVC (Table 1). The sentinel variants at the 16 loci were in or near the following genes: Supplementary Fig. 2b,c). To gain further insight into the associated variants, we assessed whether the novel sentinel variants, or their proxies, were associated with gene expression in lung tissues 20  The two novel signals in known loci were the strongest (Po5 Â 10 À 23 ) association signals after meta-analysing stage 1 and 2. The strongest signal was for a low-frequency SNP near GPR126 (rs148274477, MAF ¼ 2.4%, intergenic on chromosome 6) associated with FEV 1 /FVC (P ¼ 9.6 Â 10 À 26 , Table 1) and in high linkage disequilibrium (LD, r 2 ¼ 0.85) with a missense variant (rs17280293 (Ser123Gly), Supplementary Table 3f) in GPR126, but distinct from the previously reported signal for FEV 1 /FVC in this region 10,13 (stage-1 P value for rs148274477 conditioning on rs3817928 (ref. 10) and rs262129 (ref. 13) ¼ 1.86 Â 10 À 7 , unconditional stage-1 P ¼ 2.68 Â 10 À 9 ). GPR126 encodes a G-protein-coupled receptor and is expressed in adult and fetal lung tissue 24,25 (Supplementary Table 3d).
Other studies have shown that GPR126 is required for mice embryonic viability and cardiovascular development 26 , and that GPR126 is expressed in adult mice lung 27 . More recently, GPR126 has been shown to bind type-IV collagen, a major collagen in the lung, leading to cAMP signalling 28 .
The second strongest signal (P ¼ 1.5 Â 10 À 23 , Table 1) was an intronic SNP (rs6856422) in NPNT on chromosome 4 associated with FEV 1 /FVC, distinct from the previously discovered signal for Chr., chromosome; dist., distance; eQTL, expression quantitative trait loci; FDR, false discovery rate; FEV 1 , forced expiratory volume in 1 s; freq., frequency; FVC, forced vital capacity. Results are shown for each sentinel variant associated (Po5 Â 10 À 8 ) with FEV 1 , FEV 1 /FVC or FVC in a joint analysis of up to 92,744 individuals of European ancestry, ordered by chromosome and position. The allele frequency presented in the table corresponds to the coded allele (Coded allele freq.), which was chosen as the reference allele in the 1000 Genomes Project reference panel, and in most instances (but not always) is the major allele. Two-sided P values are given for stage 1, stage 2 and the meta-analysis of both stages. Variants reaching independent replication in stage 2 (P ¼ 0.05/ 56 ¼ 8.93 Â 10 À 4 ) are indicated with their stage-2 P value in bold. The sample sizes (N) shown are the effective sample sizes. The effective sample size within each study is the product of sample size and the imputation quality metric. Beta values reflect effect-size estimates on an inverse-normal transformed scale after adjustments for age, age 2 , sex, height and ancestry principal components, and stratified by ever-smoking status. 'eQTL gene' presents a gene expressed in the lung, whose expression is associated with the highest ranking proxy for the sentinel variant (ranking on r 2 with sentinel first, all with r 2 40.8, and on eQTL P-value second, all with Po5 Â 10 À 5 ) with FDR o10%.
FEV 1 in this region 10,12,13 . The stage-1 P value for this variant conditioned on the previously reported sentinel SNPs (rs17036341 (ref. 10) and rs10516526 (refs 12,13)) and on the sentinel SNP for FEV 1 in this analysis (rs12374256, INTS12 intron) was 4.7 Â 10 À 6 (unconditional stage-1 P ¼ 1.30 Â 10 À 7 ). Proxies of the sentinel SNP were associated with expression of INTS12 and GSTCD in blood (Supplementary Table 3a). INTS12, GSTCD and NPNT are contiguously positioned at 4q24, and are all expressed in adult and fetal lung tissues (Supplementary Table 3c,d). Our previous work characterizing GSTCD and INTS12 showed that they are oppositely transcribed genes that are to some extent co-ordinately regulated, although while GSTCD expression in human lung tissue is ubiquitous, INTS12 expression was predominantly in the nucleus of epithelial cells and pneumocytes 29 . Among the 14 novel loci, six novel loci were associated with FEV 1 /FVC. One of them was a low-frequency variant (rs113473882, intronic in LTBP4 on chromosome 19, MAF ¼ 1.5%, Table 1) in almost complete LD (r 2 ¼ 0.99) with a missense variant (rs34093919, Asp752Asn, Supplementary Table 3f) in LTBP4, which encodes a protein that binds transforming growth factor beta (TGFb) as it is secreted and targeted to the extracellular matrix. Mice deficient in ltbp4 displayed defects in lung septation and elastogenesis, which may be TGFb2 and fibulin-5 dependent 30 , and disruption of this gene in mice led to abnormal lung development, cardiomyopathy and colorectal cancer 31 . Variants near LTBP4, uncorrelated (r 2 o0.05) with the sentinel SNP we report here, have been associated with COPD 32 and smoking behaviour 33 . A further novel FEV 1 /FVC locus mapping near AP1S2 is the first to be reported for lung function on the X chromosome; sentinel SNP (rs7050036, intergenic) proxies were associated with the expression of AP1S2 and ZRSR2 in lung tissue (Supplementary Table 3a). Other new loci for FEV 1 /FVC were in or near KCNS3 (2p24.2), ASTN2 (9q33.1), RNU5F-1 (1q41) and TEKT5 (16p13.13).
The strongest signal for FEV 1 in a novel locus was upstream of TBX3 on chromosome 12 (Table 1); TBX3 is involved in the TGFb1 signalling pathway 34 . At a second novel locus for FEV 1 (rs7155279, TRIP11 intron on chromosome 14, Table 1), proxies of the sentinel variant were associated with lung and blood expression of TRIP11. TRIP11 encodes a protein associated with the Golgi apparatus 35 . In the lung, rs7155279 showed strongest association with expression of ATXN3 (Supplementary Table 3a), which encodes ataxin 3, a deubiquitinating enzyme. Expanded trinucleotide repeats in ATXN3 cause spinocerebellar ataxia-3 (ref. 36). In blood, a proxy (r 2 ¼ 0.94) for rs7155279 showed strong association (P ¼ 3 Â 10 À 34 , Supplementary   (Table 1), which was B632 kb from the TRIP11 sentinel SNP (rs7155279) and independent from it (r 2 ¼ 8.84 Â 10 À 5 ). Although this is the first report of association of a RIN3 variant with lung function, a correlated variant (rs754388, r 2 ¼ 0.99) was recently associated with moderate to severe COPD, although the association did not replicate in an independent study 38 . In a fourth novel region for FEV 1 , on chromosome 1, a sentinel SNP, rs6681426, B8 kb downstream of ENSA (Table 1) and a second signal B700 kb apart (rs4926386, Supplementary Table 2a) were both associated with ARNT expression in lung (Supplementary Table 3a). ARNT is differentially expressed during fetal lung development (Supplementary Table 3d) and acts as a cofactor for transcriptional regulation by hypoxia-inducible factor 1 during lung development 39 and may regulate cytokine responses 40 . SNP rs6681426 was also associated with the expression of LASS2 (also known as CERS2) in lung tissue (Supplementary Table 3a); lass2 knock-out mice develop lung inflammation and airway obstruction 41 . The other new locus for FEV 1 was near MN1 (22q12.1). All three novel loci for FVC had sentinel variants or close proxies associated with expression of a nearby gene in lung, implicating CCDC91, MLF1 and QSOX2, located on chromosomes 12, 3 and 9, respectively. The putative function of the key genes in each of the two known and 14 novel loci for FEV 1 , FEV 1 / FVC and FVC are summarized in Supplementary Table 4.
Functional characterization of novel signals. The protein products of genes nearest to the sentinel variant of novel signals for lung function were expressed in bronchial epithelial cells, pneumocytes or lung macrophages (Supplementary Table 3c). Among the 16 novel signals of association with lung function, sentinel variants or close proxies were cis expression quantitative trait loci (eQTLs) in lung for ARNT, MLF1, QSOX2, CCDC91 and ATXN3 (Table 1; Supplementary Table 3a), and in eight loci the sentinel variant or at least one strong proxy (r 2 40.8) was in a DNase hypersensitivity site in a cell type potentially relevant to lung function (in or near ENSA, RNU5F-1, ASTN2, CCDC91, TBX3, RIN3, TEKT5 and MN1, Supplementary Table 3b). The sentinel variant association was explained (conditional P40.01) by a missense variant in each of the two novel signals in which we detected a low-frequency sentinel variant (near GPR126 and in LTBP4), and was explained in four of the remaining novel signals by a putatively functional variant (in or near ENSA, AK097794, TEKT5 and MN1, Supplementary Table 3f and Methods). Genes in four of the novel loci showed differential expression across the pseudoglandular and canalicular stages of fetal lung development, particularly EMP2 (Supplementary Table 3d). MLF1 and ATXN3 showed differences in expression levels in bronchial brushings between COPD cases and controls (Supplementary Table 3e). We detected novel splice isoforms of 420% abundance for GFM1, TRIM32, LTBP4 and MN1 in human bronchial epithelial cells (Supplementary Fig. 3; Supplementary Methods).

Association in children.
To assess whether the 16 new sentinel variants associated with lung, function in adults may act through an effect on lung development, we assessed their association in the ALSPAC study 42 that includes 5,062 children (Supplementary  Table 5a). Eleven of the 16 sentinel variants showed consistent directions of effect in adults and children. The association with FVC of variant rs6441207 on chromosome 3 in the noncoding RNA AK097794 exceeded a Bonferroni-corrected threshold for 16 tests (Supplementary Table 5a).
Association with smoking and gene by smoking interaction. The 16 new variants had consistent effect sizes in never smokers and ever smokers, and no gene-smoking interaction (P40.05) in stage 1 (Supplementary Table 5b). We found no evidence that any of these signals were driven by smoking behaviour. Only the twobase-pair insertion on chromosome 1 (rs201204531) revealed an association (P ¼ 1.5 Â 10 À 3 ) with smoking behaviour (heavyversus never-smoking status) that met a Bonferroni-corrected threshold for 16 tests (Supplementary Table 5c). However, this variant also showed an association with FEV 1 /FVC in never smokers, and the allele associated with higher likelihood of being a smoker was associated with increased FEV 1 /FVC (Supplementary Table 5b,c).
Associations with other traits. We queried the GWAS catalog 43 for variants in 2-Mb regions centred on the sentinel variant for the 16 loci (Supplementary Table 5d). Five loci contained variants associated with height [44][45][46] (Supplementary Table 5d). In the GPR126 and LHX3 loci, the previously reported height variants were not correlated (r 2 o0.2) with the lung function variants reported here. In the AK097794, CCDC91 and TRIP11 loci, the variants associated with height were correlated (r 2 40.3) with the lung function sentinel variants, but the alleles associated with reduced height were associated with increased FEV 1 or FVC. Associations with other traits have been reported for variants in LD (r 2 40.3) with sentinel variants in regions of RIN3 (Paget's disease 47 and bone mineral density 48 ), ENSA (body fat mass 49 and melanoma 50 ) and LHX3 (thyroid hormone levels 51 ). None of the novel signals relate to known asthma loci, and the association findings were consistent after removing individuals with asthma ( Supplementary Fig. 4).
Genetic architecture of lung function traits. The proportion of the additive polygenic variance explained by the 49 signals discovered to date (Supplementary Table 6), including new and previously reported signals 10-14 is 4.0% for FEV 1 , 5.4% for FEV 1 / FVC and 3.20% for FVC (Supplementary Table 7). These estimates are likely upper bounds on the proportion of the variance explained due to the winner's curse bias. Across the 49 signals, we observed larger effect sizes for associations with lowerfrequency variants (Fig. 3), supporting the hypothesis that lowerfrequency variants will contribute to explaining the missing heritability 16 .
We examined the increase in coverage of low-frequency and common variants by the 1000 Genomes Project reference panel, compared with the HapMap imputation reference panel, at both the novel and previously reported loci (Supplementary Fig. 5a). The two association signals where the 1000 Genomes sentinel variants had low MAF (o5%), were not present when restricting the results only to variants that could be imputed using the HapMap imputation panel (rs113473882 and rs148274477 in Supplementary Fig. 5a).
For each of the 32 previously discovered regions 10-14 , we identified the most strongly associated variant present on the 1000 Genomes Project 18 reference panel and the most strongly associated variant present on the HapMap reference panel using stage-1 results, and compared the stage-1 MAFs between these two groups of variants. The 1000 Genomes sentinel variants in or near GPR126 (rs148274477), TGFB2 (rs147187942) and MMP15 (rs150232756) had MAFs that were more than twofold lower than the HapMap sentinel variant MAFs (Supplementary Fig. 5b) and were statistically independent (r 2 r0.06) from the previously discovered HapMap-imputed sentinel variants 13 . The GPR126 1000 Genomes-imputed sentinel was described above as one of the 16 new signals. We tested the association of the 1000 Genomes-imputed sentinel variants near TGFB2 and MMP15 in UK BiLEVE (Supplementary Table 8), and found supportive evidence of association for the signal near TGFB2 (rs147187942, MAF ¼ 9%, P ¼ 5.7 Â 10 À 3 ).

Discussion
In this study, we aimed to improve coverage of low-frequency variants and detect novel loci associated with lung function, by undertaking imputation of GWAS data in 17 studies and 38,199 individuals to the 1000 Genomes Project 18 reference panel, and by following up the most significant signals in an additional 54,550 individuals. Overall, 16 new association signals attained a genome-wide significance threshold corrected for multiple testing (Po5 Â 10 À 8 ) after meta-analysing stage 1 and stage 2, including 15 autosomal and one X chromosome signal. While two of the new findings relate to novel signals for FEV 1 /FVC in previously reported regions 10,12 , five new loci were identified for FEV 1 , six new loci for FEV 1 /FVC and three new loci for FVC. Including the 16 signals discovered in these analyses, the number of lung function signals discovered to date is 49 (refs 10-14), and they jointly explain a modest proportion of the additive polygenic variance (4.0% for FEV 1 , 5.4% for FEV 1 /FVC and 3.2% for FVC).
Some of the 49 distinct lung function signals 10-14 seem to cluster close to each other. If we define regions as 500 kb either side of the sentinel variants, there are three regions that each include two distinct signals (in or near INTS12-GSTCD-NPNT, GPR126 and PTCH1 (refs 10,12)), so that the 49 signals would map to 46 loci. If we use a wider definition of region (1,000 kb either side of the sentinel), there are four regions that each include two distinct signals (in or near INTS12-GSTCD-NPNT, GPR126, PTCH1 (refs 10,12) and TRIP11-RIN3). In addition, the human leukocyte antigen region on chromosome 6 includes three distinct signals (in or near ZKSCAN3-NCR3-AGER 10,12,13 ) within 3.8 Mb. Furthermore, we have shown evidence of an additional signal in the TGFB2 region, and the new lung function signal in LTBP4 lies 179 kb away from a known COPD signal 32 . These findings are consistent with reports from very large studies of height and lipids 53,54 , which report multiple signals in associated regions, and highlight the importance of taking into account LD between variants to improve our understanding of known regions. Multiple signals within known regions are likely to explain some of the hidden heritability of these traits.
To identify pathways relevant to lung function, we undertook additional analyses using MAGENTA, which have implicated pathways for platelet-derived growth factor signalling and chromatin-packaging and -remodelling. Independent analyses undertaken in a concurrent study by the UK BiLEVE consortium, which focused on the extremes of the lung function distribution 55 , highlight the histone subset of the chromatin-packaging and -remodelling pathway. The TGFb signalling pathway has now been implicated by three independent loci: an FEV 1 /FVC signal explained by a missense variant in LTBP4, which encodes a protein that binds TGFb; an FEV 1 signal upstream of TBX3, which is involved in the TGFb1 signalling pathway 34 ; and a previously reported signal downstream of TGFB2 (ref. 17). In addition, a pathway involving fibulin-5 has been implicated by two of the novel loci (LTBP4 and TRIP11). The identification, through different approaches, of pathways which appear to be involved in determining lung function should help focus future functional studies.
In agreement with previous findings for other lung function loci 12,13 , none of the 16 new associations seem to be driven by either smoking behaviour or by a gene-smoking interaction. One variant showed association with smoking behaviour that met a Bonferroni correction for 16 tests in UK BiLEVE. This variant also had an effect in never smokers in stage 1, and the allele associated with increased lung function was also associated with increased risk of smoking, which does not suggest an association with lung function mediated by smoking behaviour. Variants in five out of the 16 loci associated with lung function in this study have also shown associations with height [44][45][46] . However, the variants associated with height were either independent of those associated with lung function, or if they were correlated, the alleles associated with increased height, were associated with decreased FEV 1 or FVC. If the association with lung function was driven by an effect on height, we would expect consistent direction of effect between these two traits. Therefore, the associations identified for lung function in these regions are not likely to be driven by associations with height.
This study had a large follow-up stage, which included 54,550 individuals, of which 48,943 were contributed by the UK BiLEVE study. UK BiLEVE is a particularly powerful study since it has sampled UK Biobank individuals from the extremes of the lung function distribution, and it has spirometry performed in a uniform way across individuals. Had these data been available when we undertook the discovery stage of this study, their addition would have greatly improved the discovery power. Nevertheless, incorporating these data into the follow-up stage improved power to provide replication and deal with potential winners' curse bias. Another strength of the current study design was the increased coverage of common and low-frequency variants obtained through the imputation to 1000 Genomes Project 18 reference panel. This enabled us to detect two low-allelefrequency variants (with MAF of 1.5 and 2.4% and stage-1 effect sizes of 0.17 and 0.16 s.d. units, respectively) that have an effect on lung function. No associations with lower allele frequency variants have been detected in this study, despite having power 480% in discovery to detect associations (Po5 Â 10 À 6 ) for variants with MAF of 0.5 and 1%, and effect sizes above 0.3 and 0.2 s.d. units, respectively. The poorer imputation quality for low-allelefrequency variants coupled with the strict criteria we used to select variants for follow-up (N effective Z70%) have probably affected our ability to detect rare variants. For instance, a variant representing an additional signal for FEV 1 /FVC in the GSTCD-INTS12-NPNT region, reported by the UK BiLEVE study, where it was directly genotyped 55 , would have been detected in this analysis had we used a more lenient threshold (N effective 460%). Imputation quality for rare variants will improve as larger imputation reference panels become available. In summary, 16 new association signals for lung function have been identified in this study, including two signals explained by non-synonymous low-frequency variants. These findings highlight new loci not previously connected with lung function or COPD, and bring new insights into previously detected loci. This study also highlights the added value of imputing to new reference panels as they become available. Understanding the molecular pathways that connect the newly identified loci with lung function and COPD risk has the potential to point to new targets for therapeutic intervention.

Methods
Study design. The study consisted of two stages. Stage 1 was a meta-analysis of 17 GWAS in a total of 38,199 individuals of European ancestry. Supplementary Table 1 gives the details of these studies. Fifty-six variants selected according to the results in stage 1 were followed up in stage 2 in 54,550 European individuals.  Supplementary Table 1a for the definitions of all abbreviations). All participants provided written informed consent and studies were approved by local Research Ethics Committees and/or Institutional Review Boards. Measurements of spirometry for each study are described in the Supplementary Note. The genotyping platforms and quality-control criteria implemented by each study are described in Supplementary Table 1b. Imputation. Imputation to the all ancestries 1000 Genomes Project 18 Phase-1 reference panel released in March 2012 was undertaken using MACH 61 and minimac 62 or IMPUTE2 (ref. 63) with pre-imputation filters and parameters as shown in Supplementary Table 1b. Specific software guidelines were used to impute the non-pseudoautosomal part of the X chromosome. The pseudoautosomal part of the X chromosome was not included in these analyses. Variants were excluded if the imputation information, assessed using r2.hat (MACH and minimac) or .info (IMPUTE2), was o0.3.
Data transformation and association testing in stage 1. Linear regression of age, age 2 , sex, height and principal components for population structure was undertaken on FEV 1 , FEV 1 /FVC and FVC separately for ever smokers and never smokers. The residuals were transformed to ranks and then transformed to normally distributed z-scores. These transformed residuals were then used as the phenotype for association testing under an additive genetic model, separately for ever smokers and never smokers. For X chromosome analyses, residuals for males and females were analysed separately and dosages for males were coded 0 for 0 copies of the coded allele and 2 for 1 copy of the coded allele. The software used was specified in Supplementary Table 1b. Studies with related individuals analysed ever smokers and never smokers together adjusting the regression for ever-smoking status and used appropriate tests for association in related individuals, as described in the Supplementary Note.
Meta-analysis of stage-1 data. Quality-control checks on the stage-1 data were undertaken using GWAtoolbox 64 and R version 3.0.2 (see URLs). All meta-analysis steps were undertaken using inverse variance-weighted fixed-effects meta-analysis. Effect estimates were flipped across studies so that the coded allele was the reference allele in the 1000 Genomes Project 18 reference panel. For each study with unrelated individuals, autosomal chromosomes results were meta-analysed between ever smokers and never smokers. After that, all study-specific standard errors were corrected using genomic control 19 . Study-specific genomic inflation factor estimates are shown in Supplementary Table 1a. Finally, effect-size estimates and s.e. were combined across studies, and genomic control 19 was applied again at the metaanalysis level. For the X chromosome, studies of unrelated individuals meta-analysed smoking strata estimates within sex strata and then meta-analysed pooled sex strata estimates. After that, genomic control 19 was applied to each study and results were meta-analysed across studies. Genomic control 19 was applied again after the metaanalysis. To describe the effect of imperfect imputation on power, for each variant we report the effective sample size (N effective), which is the sum of the study-specific products of the sample size and the imputation quality metric. Meta-analysis statistics and figures were produced using R version 3.0.2 (see URLs).
Selection of variants for stage 2. Variants with N effective o70% were filtered out before selecting variants for follow-up (8,916,621 variants remained after filtering). Independent regions (±500 kb from the sentinel variant) were selected for FEV 1 , FEV 1 /FVC and FVC if the sentinel SNP or indel had Po5 Â 10 À 6 . If the same variant was selected for different traits, it was followed up for all the traits. If two different variants were selected for different traits within the same region, or if any of the regions selected had already been identified in previous GWAS 10-14 but the sentinel variant was different from that previously reported, conditional analyses were undertaken to assess whether the signals within the same regions were distinct. If previously reported sentinel SNPs for a region were strongly correlated (r 2 40.9), we only conditioned on the SNP that had shown the strongest association. If two variants were selected for different traits within the same new region, both variants were taken forward if their P-value conditioning on the other variant was o5 Â 10 À 6 ; if not, only the variant with the most significant P value was taken forward. Variants within known regions were only taken forward if their P value conditioned on the previously reported variant was o5 Â 10 À 6 . Conditional analyses were undertaken using GCTA 65 , and B58C data were used to estimate LD. In total, 56 variants (49 SNPs and seven indels) were taken forward for follow-up, two of which were distinct signals within previously reported regions 10,12,13 . These variants are listed in Supplementary Table 2. Previously reported signals [10][11][12][13][14] were not followed up.
Stage-2 samples. The 48 SNPs and seven indels on autosomal chromosomes were followed up in up to 54,550 individuals from four studies with in silico data: ECRHS, PIVUS, TwinsUK and UK BiLEVE (see Supplementary Table 1a for the definitions of all abbreviations). All participants provided written informed consent and studies were approved by local Research Ethics Committees and/or Institutional Review Boards. One SNP in the chromosome X was followed up in 52,359 individuals from PIVUS, TwinsUK and UK BiLEVE. Measurements of spirometry for each study are described in the Supplementary Note.
Meta-analysis of stage-2 data. All stage-2 studies undertook linear regression of age, age 2 , sex, height, ever-smoking status and principal components for population structure, if available, on FEV 1 , FEV 1 /FVC and FVC, then the residuals were transformed to ranks and to normally distributed Z-scores. These transformed residuals were then used as the phenotype for association testing under an additive genetic model. For the X chromosome analyses, allele dosages for hemizygous males were coded as 2. Effect sizes were flipped to be consistent with the stage-1 estimates, using the reference allele in 1000 Genomes Project 18 as the coded allele. Genomic control 19 was applied for studies that undertook the analysis genomewide. Effect estimates and s.e. were combined across the stage-2 studies using an inverse variance-weighted meta-analysis.
Combination of stage 1 and 2 and multiple testing correction. A meta-analysis of stage-1 and stage-2 results was undertaken using inverse variance-weighted meta-analysis. We take into account the multiple tests undertaken by describing an association as genome-wide significant if it has Po5 Â 10 À 8 . In addition, we assessed whether any of the findings achieved independent replication in stage 2 using a threshold corrected for the number of variants followed up (0.05/ 56 ¼ 8.93 Â 10 À 4 ). NATURE COMMUNICATIONS | DOI: 10.1038/ncomms9658 ARTICLE Functional characterization of novel loci. A series of analyses were undertaken to provide insights into the expression of genes within the 16 loci (defined as ± 1 Mb either side of the sentinel variant) represented here. Blood 21 and lung tissue 20 eQTL analyses were undertaken for variants in these loci that were in LD (r 2 40.3) with the sentinel variant in the region. We assessed whether variants within these loci that were strongly correlated with the sentinel variants (r 2 40.8) were in DNase hypersensitivity sites as defined by ENCODE 22 for cells potentially relevant to lung function. We also carried out conditional analyses, using GCTA 65 , of sentinel variants conditioning on functional variants within the loci to assess whether the association signals were explained by functional variants (P value of the sentinel variant conditioned on the functional variant, conditional P, 40.01). Functional variants were defined using SIFT 66 , PolyPhen-2 (ref. 67), CADD 68 and GWAVA 69 databases. Additional analyses were undertaken for a subset of priority genes within the 16 loci (description of the selection is given in the Supplementary Methods). These included RNA-seq analyses to confirm messenger RNA expression in a lungrelevant cell (bronchial epithelium) and detect novel splice isoforms; assessment of differential expression across pseudoglandular and canalicular stages of human fetal lung development using gestational age as a continuous variable in linear regression 25 , and assessment of differences in expression levels in bronchial brushings between COPD cases and smoking controls 70 . Details for all these analyses are provided in the Supplementary Methods.
Associations with other traits. The association of the 16 sentinel variants with the following traits was assessed: lung function in children undertaking the same analysis as for adults in the ALSPAC data set 42 ; gene by smoking interaction by undertaking a Z-test comparing the effect of a given variant in ever smokers and in never smokers using stage-1 results; smoking behaviour by undertaking a logistic regression analysis with heavy-versus never-smoking status as an outcome in the UK BiLEVE data set. In addition, the GWAS catalog 43 was queried for variants in 2-Mb regions centred on the sentinel variant for the 16 loci. Variants that were genome-wide significant (Po5 Â 10 À 8 ) in the GWAS catalog 43  Additional analyses. Heterogeneity tests were undertaken for the 16 sentinel variants in stage 1. We undertook stepwise conditional analyses as performed by GCTA 65 in each locus to identify additional signals. Full methods and results are described in the Supplementary Notes.