Whole genome sequence analysis of pulmonary function and COPD in 19,996 multi-ethnic participants

Chronic obstructive pulmonary disease (COPD), diagnosed by reduced lung function, is a leading cause of morbidity and mortality. We performed whole genome sequence (WGS) analysis of lung function and COPD in a multi-ethnic sample of 11,497 participants from population- and family-based studies, and 8499 individuals from COPD-enriched studies in the NHLBI Trans-Omics for Precision Medicine (TOPMed) Program. We identify at genome-wide significance 10 known GWAS loci and 22 distinct, previously unreported loci, including two common variant signals from stratified analysis of African Americans. Four novel common variants within the regions of PIAS1, RGN (two variants) and FTO show evidence of replication in the UK Biobank (European ancestry n ~ 320,000), while colocalization analyses leveraging multi-omic data from GTEx and TOPMed identify potential molecular mechanisms underlying four of the 22 novel loci. Our study demonstrates the value of performing WGS analyses and multi-omic follow-up in cohorts of diverse ancestry.


Phenotype Harmonization
The NHLBI Pooled Cohorts Study (PCS) 1 harmonized and pooled data from nine large US epidemiologic cohorts that conducted lung function assessments over the last four decades. Additionally, data on self-administered questionnaires, with detailed questions regarding tobacco consumption, past medical history, medications, respiratory symptoms, lipids, renal biomarkers, etc., were also harmonized. Data on CLRD events was harmonized using either adjudicated CLRD hospitalizations or ICD data for all hospitalizations occurring over follow-up.
Among the cohorts included in the NHLBI PCS, the follow cohorts were also included in our TOPMed WGS analyses, for which we utilized harmonized data sets from NHLBI PCS: Atherosclerosis Risk in Communities (ARIC) study; Cardiovascular Health Study (CHS); Framingham Offspring Cohort (FHS-O); Jackson Heart Study (JHS), the Multi-Ethnic Study of Atherosclerosis (MESA). For the purpose of phenotype harmonization in TOPMed, harmonized data sets for each of the participating cohorts were provided by the NHLBI Pooled Cohorts Study for analyses in the current WGS analyses. We note that a subset of the ARIC participants were later recruited into JHS. For TOPMed purposes, participants in the ARIC-JHS overlap group were not included as part of Exam 4 in ARIC. For JHS, ARIC recruits were not excluded. For TOPMed studies not included in the NHLBI Pooled Cohorts Study (the Cleveland Family Study [CFS], Genetic Epidemiology of COPD [COPDGene], and Boston Early-Onset COPD Study [EOCOPD]), phenotype harmonization was conducted separately by each of the participating studies, following as closely as possible with the NHLBI Pooled Cohorts Study variable definitions and procedures.
For studies with multiple longitudinal spirometry measures, we worked with investigators from each study to determine the most practical way to construct a cross-sectional subset of data. All spirometry data utilized for this effort were obtained as prebronchodilator measures. Details are provided in the cohort descriptions below.

Cohort Descriptions: Population-and Family-based Cohorts
The Atherosclerosis Risk in Communities Study (ARIC) ARIC study is a prospective cohort study designed to evaluate the etiology of atherosclerosis and its clinical sequelae in a general population based sample of adults. 2  At each visit, spirometry testing protocols were standardized across the four ARIC field centers, calibration checks were performed daily, and the standardization of data collection and management was coordinated across field centers by a single pulmonary function reading center. Each participant's best FEV 1 and FVC of three acceptable maneuvers, based on the centralized expert review, was used for analysis. 3 For the current cross-sectional WGS analysis, data from the most recent spirometry exam were utilized for participants having multiple longitudinal measures, as these measures were viewed to be the most consistent with those available for the other more contemporary cohorts.

The Cardiovascular Health Study (CHS)
CHS is a population-based cohort study of risk factors for coronary heart disease and stroke in adults ≥65 years conducted across four field centers. 4 The original predominantly European ancestry cohort of 5,201 persons was recruited in 1989-1990 from random samples of the Medicare eligibility lists; subsequently, an additional predominantly African-American cohort of 687 persons was enrolled for a total sample of 5,888. Forty one European ancestry CHS participants that had been selected for inclusion in the second phase of the TOPMed sequencing program were used included in our discovery analyses.
Blood samples were drawn from all participants at their baseline examination and DNA was subsequently extracted from available samples. CHS was approved by institutional review committees at each field center and individuals in the present analysis had available DNA and gave informed consent including consent to use of genetic information for the study of cardiovascular disease. Pulmonary function testing was conducted at the baseline visit and follow-up visits in years four and seven. 5,6 Spirometry technicians were centrally trained and certified prior to recruitment of participants. A standard spirometry system, including a Collins Survey I water-seal spirometer (Collins Medical, Inc., Braintree, Massachusetts) and software from S&M Instruments (Doylestown, Pennsylvania), was used by technicians at all four recruitment centers. Stringent quality assurance procedures for spirometry testing exceeded ATS recommendations. 5 For the current cross-sectional WGS analysis, data from the most recent spirometry exam were utilized for participants having multiple longitudinal measures, as these measures were viewed to be the most consistent with those available for the other more contemporary cohorts.

The Cleveland Family Study (CFS)
CFS is a family-based longitudinal study that includes participants with laboratory diagnosed sleep apnea, their family members and neighborhood control families followed between 1990 and 2006. Four examinations over 16 years provided measurements of sleep apnea with overnight polysomnography, anthropometry, and other related phenotypes. 7 At each exam, forced vital capacity (FVC) and forced expiratory volume at one second (FEV 1 ) were obtained using a calibrated spirometer (Multi-Spiro). While seated, participants were encouraged to perform between 5-8 maneuvers to obtain 3 curves that met ATS standards for acceptability and reproducibility. For the current cross-sectional WGS analysis, data from the most recent spirometry exam were utilized for participants having multiple longitudinal measures, since the most recent exam provided the largest number of participants with nonmissing spirometry data.

The Framingham Heart Study (FHS)
The Original Cohort of the Framingham Study was established between 1948 and 1952 as a random sample of 5,209 adult residents of the town of Framingham, Massachusetts. Between 1971 and 1975, the Framingham Study was expanded to include a second generation, the Offspring Cohort, comprising 5,124 adults who were the offspring, or spouses of the offspring, of Original Cohort participants. 8 The Offspring Cohort has returned for examinations approximately every 4 years since enrollment, and spirometry data are available for the 3rd, 5th, 6th, 7th, and 8th examinations.
Spirometry for the Offspring Cohort 3rd examination (1983-87) was performed with a Collins Survey II spirometer interfaced with an Eagle II microprocessor (Warren E. Collins, Inc., Braintree, MA). Spirometry for the 5th (1991-95), 6th (1995-98), and 7th (1998-2001) examinations were performed with a Collins Survey II spirometer interfaced with a personal computer equipped with software developed by S & M Instruments (Doylestown, PA) and adapted for use in epidemiologic studies. Spirometry for the 8th (2005-08) and 9th  examinations was performed with the Collins Comprehensive Pulmonary Laboratory (CPL) system with Collins 2000 Plus/SQL Software (Nspire Health, Inc., Longmont, CO). Spirometry was performed in accordance with contemporaneous guidelines of the American Thoracic Society. For the current cross-sectional WGS analysis, data from the earliest spirometry exam were utilized for participants having multiple longitudinal measures. We used this strategy to construct a cross-sectional subset because there were changes in the spirometers used in FHS over time, and selecting the earliest available spirometry measures for each participant was a practical way to create an internally consistent data set within this cohort.

The Jackson Heart Study (JHS)
JHS is a large, population-based observational study evaluating the etiology of cardiovascular diseases and related disorders among African Americans residing in the three counties (Hinds, Madison, and Rankin) that make up the Jackson, Mississippi metropolitan area. 9,10 Data and biologic materials have been collected from 5,301 participants, including a nested family cohort of 1,498 members of 264 families. The age at enrollment for the unrelated cohort was 35-84 years; the family cohort included related individuals >21 years old. During a baseline examination (2000)(2001)(2002)(2003)(2004) and two follow-up examinations (2005-2008 and 2009-2012), participants provided extensive medical and social history, had an array of physical and biochemical measurements and diagnostic procedures, and provided blood for genomic DNA. 11 The study population is characterized by a high prevalence of diabetes, hypertension, obesity, and related disorders. Annual follow-up interviews and cohort surveillance are ongoing. For the current cross-sectional WGS effort, only baseline spirometry data were available and utilized for analyses.

The Multi-Ethnic Study of Atherosclerosis (MESA)
MESA is a longitudinal study of subclinical cardiovascular disease and risk factors that predict progression to clinically overt cardiovascular disease or progression of the subclinical disease. 12  St. Paul, Minnesota; Chicago; and Los Angeles. Exclusion criteria were clinical cardiovascular disease, weight exceeding 136 kg (300 lb.), pregnancy, and impediment to long-term participation. The MESA Lung Study performed spirometry at Exams 3, 4 and 5 following the 2005 ATS/ERS guidelines in a subset of the MESA Study. 13 All participants provided informed consent and the protocols of MESA were approved by the IRBs of collaborating institutions and the National Heart, Lung and Blood Institute. For the current cross-sectional WGS analysis, data from the earliest spirometry exam were utilized for participants having multiple longitudinal measures, as the measures from Exams 3 and 4 were viewed to be more internally consistent with each other, and only a smaller subset of participants had measures available at Exam 5.

Boston Early-Onset COPD (EOCOPD) Study
The Boston Early-Onset COPD (EOCOPD) study was designed to study genetic factors for early-onset and severe COPD. 14 Probands were selected to be physician-diagnosed COPD cases with FEV 1 ≤ 40% predicted and age ≤ 53. Subjects with severe alpha-1 antitrypsin deficiency and other chronic lung diseases (except asthma) were excluded. All subjects completed a questionnaire and spirometry testing before and after bronchodilator administration. Blood samples and written informed consent were obtained for each study subject. A subset of the most severe unrelated probands from this study were sent for whole-genome sequencing through the TOPMed project. For the current cross-sectional WGS effort, only baseline spirometry data were available and utilized for analyses.

Genetic Epidemiology of COPD (COPDGene)
COPDGene 15 is a multi-center observational cohort for epidemiologic and genetic study of over 10,000 subjects (2/3 non-Hispanic White and 1/3 African Americans) with at least 10 pack-years of cigarettes smoking with and without COPD. All subjects underwent extensive phenotyping, including lung function, CT phenotypes (including emphysema and expiratory gas trapping). Pre-and post-bronchodilator spirometry measures were obtained using a standardized protocol and spirometer (ndd EasyOne Spirometer, Zurich, Switzerland). All study sites obtained local IRB approval to enroll participants and all subjects provided informed consent. For the current cross-sectional WGS effort, only baseline spirometry data were utilized for analyses, as the baseline data were complete and only a subset of participants had follow-up data.

Quality control of samples and variants included in our TOPMed pooled cohorts WGS analysis
For the pooled cohort, we first removed subjects who failed sample-level quality control. The filters included checking for pedigree errors, discrepancies between self-reported and genetic sex, and concordance between prior SNP array genotypes and WGSderived genotypes (see https://www.nhlbiwgs.org/topmed-whole-genome-sequencing-project-freeze-5b-phases-1-and-2). In addition, duplicated subjects were identified using ~250k independent markers. Only one subject from each pair of duplicates was kept. There were 19,996 subjects of the pooled cohort that passed the filtering.
For site-level quality control, variants were removed based on Mendelian discordance, a support vector machine (SVM) quality filter and excess heterozygosity filter (see https://www.nhlbiwgs.org/topmed-whole-genome-sequencing-project-freeze-5b-phases-1-and-2). After filtering on variant level quality control and expected heterozygosity count > 30, there were 28,740,775 variants remaining for analysis.

Comparison of TOPMed WGS calls with genotypes imputed by GWAS in MESA
We examined the R-squared of genotypes for the novel associated variants using variant calls from TOPMed Freeze 5b compared to genotypes obtained using imputation of genome-wide genotypes in MESA to various reference panels including the 1000 Genomes Phase 1 16 , 1000 Genomes Phase 3 17 , the Haplotype Reference Consortium 18 , and the TOPMed reference panel. R-squared between TOPMed called genotypes and imputed genotype dosage values was computed as the square of the Pearson correlation.
Genome-wide genotyping in MESA: Participants in the original MESA cohort, the MESA Family Study and the MESA Air Pollution Study who consented to genetic analyses were genotyped in 2009 using the Affymetrix Human SNP array 6.0. Genotype quality control for these data included filter on SNP level call rate < 95%, individual level call rate < 95%, heterozygosity > 53%. The cleaned genotypic data was deposited with MESA phenotypic data into dbGaP as the MESA SHARe project (study accession phs000209, http://www.ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?study_id=phs000209.v7.p2); 8,224 consenting individuals (2,685 White, 2,588 non-Hispanic African-American, 2,174 Hispanic, 777 Chinese) were included, with 897,981 SNPs passing study specific quality control (QC).
Genome-wide imputation in MESA: IMPUTE version 2.2.2 was used to perform imputation for the MESA SHARe genotypes to the cosmopolitan 1,000 Genomes Phase 1 v3 March 2012 reference set. 16 The University of Michigan imputation server was used for imputation of the MESA SHARe genotypes to the 1,000 Genomes Phase 3 integrated variant set 17 and the Haplotype Reference Consortium Release 1 data 18 .

Replication studies
For those novel WGS variants identified in TOPMed for association with quantitative lung function traits (FEV 1 , FVC and FEV 1 /FVC ratio), we pursued replication for the same traits in multiple independent cohorts (described below). Variants identified for associations with moderate-to-severe or severe COPD in TOPMed were also examined for association with FEV 1 /FVC ratio in the replication cohorts. Variants were reported as demonstrating statistically significant replication if they achieve Bonferroni-corrected statistical significance under a family-wise error rate of 0.05, after accounting for the number of variants under consideration for each trait. For variants demonstrating evidence of replication in relation to lung function, we further examined their association with measures of smoking behavior in order to determine whether the replicated associations reflected primary association with lung function, or indirect association due to smoking.

UK Biobank
The UK Biobank project 19 is a prospective cohort study of approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. Deep phenotypic information, health data, and biological samples have been collected for all participants. These include questionnaire on socio-demographic, lifestyle and health-related factors, physical measurements, blood, urine, and saliva samples at recruitment. Eye measures, electrocardiograph test, arterial stiffness and hearing test were performed during assessment visit. All subjects provided electronic signed consent for follow-up through linkage to their health-related records.
Spirometry phenotypes were cleaned and the resulting data were used to conduct a genome-wide association study in participants from the UK Biobank. 20 Replication of lung function signals was examined using imputation to the Haplotype Reference Consortium (HRC) 18 in the 321,047 UK Biobank European samples and in 4,350 African UK Biobank samples. African ancestry samples were identified in UK Biobank by kmeans clustering of the first two principal components of ancestry. 20 Residuals from linear regression of each trait (FEV 1 , FVC, FEV 1 /FVC) against age, age2, sex, height, smoking status (ever or never) and genotyping array were ranked and inverse-normal transformed, giving normally distributed Z-scores, and genetic association testing was run using an additive genetic model implemented in BOLT-LMM v2.3 21 , using a linear mixed model to account for relatedness and fine-scale population structure. Lung function signals were also tested for association with smoking behaviour traits, including smoking initiation (SI), smoking cessation (SC) and heaviness of smoking index (HSI) in up to 447,062 European and 7,702 African UK Biobank samples. Genetic association testing of smoking behaviour traits was run under an additive genetic model using BOLT-LMM v2.320 21 . For binary smoking behaviour phenotypes, age, age squared, sex and genotyping array were used as covariates. For the quantitative HSI phenotype, residuals from linear regression against age, age squared, sex and genotyping array were ranked and inverse-normal transformed to obtain adjusted, normally distributed Zscores. For replication analysis, we filter variants based on imputation R-squared > 0.3 and effective heterozygosity count > 15.

Hispanic Community Health Study / Study of Latinos (HCHS/SOL)
HCHS/SOL is a community-based cohort study of 16,415 self-identified Hispanic/Latino persons aged 18 to 74 years recruited from four U.S. communities. [22][23][24] Institutional Review Boards at each field center approved study protocols, and written informed consent was obtained from all participants. For the current replication effort, we examined results from a published GWAS of pulmonary function in 11,822 participants from the HCHS/SOL 25 , performed using imputation to the 1000 Genomes Phase 1 v3 16 . Analyses of FEV 1 , FVC and FEV 1 /FVC ratio used linear mixed models, stratified by Hispanic/Latino ancestry group, and adjusted for age, age squared, sex, height, height squared, study center, smoking status, pack-years, sampling weights, and the first five PCs, as fixed effects. We used random effects for genetic relatedness (kinship) and household and community (block unit) to account for environmental correlation. Results from each Hispanic/Latino ancestry group were then meta-analyzed using the MetaCor method. 26 Replication results are reported in the current manuscript after filter on imputation R-squared > 0.3 and effective minor heterozygosity count > 15.
Colocalization analysis eQTL in GTEx v7: eQTL data used for the analyses described in this manuscript were obtained from the GTEx Portal on March 22, 2019 and represented sample sizes ranging from 70 to 491 per tissue, with 383 samples for lung. As the summary statistics for GTEx v7 were in human genome assembly hg19, the liftOver tool was used to map the coordinates to the same assembly. eQTL and mQTL in MESA: RNA-seq was performed for PBMCs from MESA and profiling of genome-wide methylation was performed for whole blood using the Illumina EPIC array. eQTL and mQTL mapping was performed using tensorQTL 27 in ~900 individuals from MESA Exam 1 and Exam 5 data for variants with MAF ≥ 1%. The mapping window was set to +/-1Mb of the TSS for eQTLs and +/-500kb of the CpG site for mQTLs. We used 11 genotype PCs as covariates to control for population effects, and we used PEER factors 28 to control for both technical and biological variation. The optimal number of PEER factors to use was determined to maximize the cis-eGene and cis-mProbe discovery.
Colocalization analysis: Colocalization analysis of novel GWAS variants and eQTL/mQTL was performed using the R/coloc v3.1 package 29 with default priors (p1 = 1e-04, p2 = 1e-04, and p12 = 1e-05). Using the molecular QTL resources described above, we conducted colocalization analysis on all variants within +/-500kb of the novel WGS variants (reported in Supplementary Data 2), as well as those identified in conditional analysis (Supplementary Data 11). We kept only results where the lead WGS variant was a significant e/mQTL and posterior probability of colocalization (PP4) > 0.5. We further focused on results where the model of a single shared causal variant driving both association signals (PP4) was strongly preferred over a model of two distinct causal variants (PP3) -PP4/(PP3 + PP4) ≥ 0.9. In addition, we required adequate power for these results to detect colocalization, which we quantified using a cutoff of PP3 + PP4 ≥ 0.8.

Follow-up of selected methylation sites:
Those methylation sites demonstrating colocalization with TOPMed WGS were examined further to determine whether measured methylation from MESA Exam 1 was associated with the corresponding lung function traits in MESA which were based on follow-up measures from MESA Exam 4. The association of measured methylation with lung function was tested in linear regression of the following form: PFT ~ methylation + age + age^2 + sex + height + height^2 + weight (for FVC only) + current smoking + former smoking + pack-years of smoking + genetic PCs of ancestry + methylation PEER factors. We applied inverse normal transformation on both PFT trait and methylation levels of the colocalized methylation sites. We report methylation-lung function associations for those results demonstrating Bonferroni-corrected significance accounting for the number of methylation markers tested.
Overlap with pathways previously implicated by GWAS: We examined genes in the following categories for overlap with pathways previously implicated by GWAS: (1) novel genes supported by eQTL colocalization, as indicated in Tables 2 and 3, (2) genes supported by findings in our TOPMed analysis that overlap previously reported GWAS signals, and (3) novel variants identified in our study that are located within gene introns. We selected for examination those gene ontology (GO) terms represented as enriched among genes implicated by GWAS in Supplementary Table 15 of the recent lung function GWAS paper by Shrine et al. 20 We then used the database of GO term inclusion provided at http://amigo.geneontology.org/ to report the overlap of our identified genes with the selected GO terms.

Supplementary Note 1 Study Specific Acknowledgments: Population-and Family-based Cohorts
The Atherosclerosis Risk in Communities Study SJL is supported by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences (ZO1 ES043012). The Genome Sequencing Program (GSP) was funded by the National Human Genome Research Institute (NHGRI), the National Heart, Lung, and Blood Institute (NHLBI), and the National Eye Institute (NEI). The GSP Coordinating Center (U24 HG008956) contributed to cross-program scientific initiatives and provided logistical and general study coordination. The Centers for Common Disease Genomics (CCDG) program was supported by NHGRI and NHLBI, and whole genome sequencing was performed at the Baylor College of Medicine Human Genome Sequencing Center (UM1 HG008898 and R01HL059367).
The Atherosclerosis Risk in Communities study has been funded in whole or in part with Federal funds from the National Heart, Lung, and Blood Institute, National Institutes of Health, Department of Health and Human Services (contract numbers HHSN268201700001I, HHSN268201700002I, HHSN268201700003I, HHSN268201700004I and HHSN268201700005I). The authors thank the staff and participants of the ARIC study for their important contributions.

The Cardiovascular Health Study
This Cardiovascular Health Study (CHS) research was supported by NHLBI contracts HHSN268201200036C, HHSN268200800007C, HHSN268200960009C, HHSN268201800001C N01HC55222, N01HC85079, N01HC85080, N01HC85081, N01HC85082, N01HC85083, N01HC85086; and NHLBI grants U01HL080295, U01HL130114, R01HL087652, R01HL105756, R01HL103612, R01HL085251, and R01HL120393 with additional contribution from the National Institute of Neurological Disorders and Stroke (NINDS). Additional support was provided through R01AG023629 from the National Institute on Aging (NIA). A full list of principal CHS investigators and institutions can be found at CHS-NHLBI.org.

The Cleveland Family Study
The Cleveland Family Study and SR were supported by NIH grants HL 046389, HL113338, and 1R35HL135818. BC is supported by the NIH grant K01 HL135405 and an American Thoracic Society Foundation Unrestricted Grant (Sleep) (http://foundation.thoracic.org).

The Framingham Heart Study
The Framingham Heart Study (FHS) acknowledges the support of contracts NO1-HC-25195, HHSN268201500001I and 75N92019D00031 from the National Heart, Lung and Blood Institute and grant supplement R01 HL092577-06S1 for this research. We also acknowledge the dedication of the FHS study participants without whom this research would not be possible. Dr. Vasan is supported in part by the Evans Medical Foundation and the Jay and Louis Coffman Endowment from the Department of Medicine, Boston University School of Medicine.

Genetic Epidemiology of COPD (COPDGene)
The project described was supported by Award Number U01 HL089897 and Award Number U01 HL089856 from the National Heart, Lung, and Blood Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Heart, Lung, and Blood Institute or the National Institutes of Health.

COPD Foundation Funding
The COPDGene ® project is also supported by the COPD Foundation through contributions made to an Industry Advisory Board comprised of AstraZeneca, Boehringer Ingelheim, GlaxoSmithKline, Novartis, Pfizer, Siemens and Sunovion.

Study Specific Acknowledgments: Replication cohorts
The    The upper plot shows the TOPMed WGS results for FEV 1 based on analysis of combined White samples. The lower plot shows the corresponding local genetic association results for methylation levels at cg06249499 from MESA Exam 1. Note that colocalization was observed in comparing WGS results from combined White samples with genetic association for methylation in whole blood for the same site cg06249499 at both MESA Exams 1 and 5. Here, we focus on the results from MESA Exam 1 for the purpose of display.