Lung disease in cystic fibrosis (CF) varies even among patients homozygous for the same genetic mutation (Phe508del), accounting for ~70% of CF alleles in Caucasian populations.1 Using genome-wide association studies (GWASs) of 6,365 subjects, the International Cystic Fibrosis Gene Modifier Consortium (abbreviated Consortium) has identified genetic loci associated with CF lung disease severity, including an intergenic region on chr11p13.2,3 The region is flanked by genes of high potential relevance to CF lung disease, including: EHF, a transcription factor involved in epithelial differentiation;4,5 and APIP, a dual-function protein with roles in apoptosis and methionine salvage.6,7 The gene PDHX is also in the region and shares a promoter with APIP. Interestingly, the association signal in this region was stronger in 4,139 subjects homozygous for the most common CF mutation (Phe508del) compared with an analysis that included CF individuals with any severe CFTR genotype.2

Targeted resequencing is a viable approach to obtain a detailed genetic variant map to facilitate post-GWAS mechanistic studies. Toward this end, targeted resequencing (termed “ReSeqChr11”) between the 5′ end of EHF and the 3′ end of PDHX (human reference genome assembly (hg19), chr11:34,641,749–35,017,674) was conducted using National Heart, Lung, and Blood Institute (NHLBI) Resequencing and Genotyping Services (rsng.nhlbi.nih.gov) (Figures 1a–c and Supplementary Figure 1). Resequencing was conducted in 377 homozygous Phe508del subjects selected from the larger Genetic Modifier Study cohort (University of North Carolina at Chapel Hill/Case Western Reserve University) that was recruited based on extremes of lung disease severity.8 Samples selected were balanced for gender, lung disease severity (KNoRMA; equal numbers >0.4 and <0.4),9 and by the genotypes of the first reported3 single-nucleotide polymorphism (SNP) with lowest P value, rs12793173, that was found to be in strong linkage disequilibrium (LD) with the GWAS1+22 top SNP, rs10742326 (Supplementary Table 1 and Supplementary Figures 1 and 2).2,3 All library preparation, enrichment, and sequencing were performed at University of Washington Genome Center at Seattle using a custom NimbleGen SeqCap probe library. Resequencing provided coverage of ~81% of the entire region with an average of 259× coverage in the sequenced regions. Gaps in coverage were because of highly repetitive features not compatible with the NimbleGen capture platform (Figure 1c).

Figure 1
figure 1

Summary of ReSeqChr11 and cystic fibrosis (CF) lung disease association testing. Annotation information and CF lung disease association test results were converted into either BED, or BEDGRAPH format with hg19 coordinates, and displayed as custom tracks on University of California Santa Cruz (UCSC) genome browser with other relevant public annotations. The sections are: (a) scale bar and genome coordinates on chr11 of UCSC hg19 reference genome; (b) UCSC genes annotation showing EHF, APIP, and PDHX genomic structure; (c) tiled region of NimbleGen probes used to enrich the local genomic DNA to be resequenced; (d) CF lung disease severity association P values for imputed single-nucleotide polymorphisms (SNPs)/indels among Phe508del homozygous patients; (e) CF lung disease severity association P values in non-Phe508del homozygous patients; (f) CF lung disease severity association P values for microsatellite length polymorphisms in Phe508del homozygous patients; (g) Scale bar and genome coordinates on chr11 of UCSC hg19 reference genome for zoom-in region; (h) zoom-in view of genomic locations of SNPs/indels with most significant CF lung disease severity association (P value <10−7), red bar denoted with red asterisk (*) represents the top GWAS1+2 association SNP rs107423262 and blue asterisk (*) two SNPs down denotes top SNP in this study (rs374869483); (i) genomic locations of all SNP/indels in the region with minor allele frequency (MAF) >0.05 colored by linkage disequilibrium (LD) R2 values (R2>0.8, red lines; R2=0.6–0.8, blue lines) compared with rs10742326; (j) DNase I hypersensitivity peak assignments from pHTE cells; (k) DNase I hypersensitivity peak assignments from K562 leukemia cell line; and (l) summary of transcription factor binding motifs from ENCODE chromatin immunoprecipitation sequencing (ChIP-seq) track.

The paired-end 49 bp sequence reads were mapped to reference genome hg19 by BWA v0.5.9-r16.10 SNP and insertion/deletion (indel) calls were made at the University of Washington using an automated software pipeline based on GATK toolkits v1.0-6125.11,12 The initial resequencing variant calls contained 4,800 variants, including SNPs/indels, with National Center for Biotechnology Information (NCBI) dbSNP134 annotation. Results for 94 previously genotyped SNPs were 99.98% concordant. SNP/indel calls were manually reviewed using both quality information from the variant call format (VCF) files and selected spot checks of sequencing read alignment from the binary alignment map (BAM) files, using the Integrative Genomics Viewer (IGV) genome browser.13 When reviewed by manual inspection in Integrative Genomics Viewer, only 40% of indels called by the GATK toolkit were verified as expected.14 The final, manually reviewed, SNP/indel calls from the 377 patient samples contained 2,991 variant calls over the resequenced region. In addition, 101 polymorphic microsatellites were called using GenoTan v0.1.5.15 We also identified a 113-bp deletion, corresponding to rs78669256 (115 bp deletion on hg19; 0.34 minor allele frequency (MAF) in our resequenced samples), that is part of a LINE element whose allele frequency and validation was unknown from single-nucleotide polymorphism database (dbSNP; Supplementary Figure 3).

The 2,991 variants were updated with NCBI dbSNP141 annotation through chromosomal location on hg19 (using ANNOVAR16 and University of California Santa Cruz (UCSC) genome databases), and 946 of them represent novel variants that tended to be rare (Supplementary Figure 4, black dots), with only 1 deletion (rs535729750 in dbSNP142) with MAF >0.1 and only 23 deletions with MAF >0.01.

For association with CF lung disease severity, imputation to the remainder (n=6,365) of the Consortium patient cohort2 was performed using MACH and Minimac2, as previously described,17 except that reviewed variants from ReSeqChr11 were used as the reference set instead of 1000 Genomes variants. For microsatellite repeat polymorphisms, imputation was performed using Beagle v4.018 because of its ability to impute multiple alleles at a single locus. Of the total 3,160 imputed variants over the resequenced region, 1,485 were of sufficient MAF (>0.01) and imputation quality (R2>0.3, Supplementary Figure 4, vertical purple dashed line) to provide reliable tests of association with CF lung disease severity. Association testing was performed as previously described2 with genotype PCs and sex as covariates. Briefly, the Consortium’s quantitative lung phenotype9 was regressed on imputed allele dosages using linear regression for each Consortium cohort,2 followed by a meta-analysis to obtain the final reported random effect P value. Microsatellite repeat variants were coded using the most common allele as the reference, and all other alleles as the alternative alleles before association testing.

Overall, the association P values are highly significant and reflect published GWAS results2,3 (Table 1, Figure 1, and Supplementary Figure 5). The top associated SNPs (P<10−7) for Phe508del homozygous subjects (n=4,139) were located between chr11:34,776,532 and 34,819,022, with the top SNP identified as rs374869483 (Figure 1d and blue asterisk Figure 1h), an indel located 580 bp downstream from the published top SNP (rs10742326)2 (red asterisk, Figure 1h). Importantly, no significant association findings (P<10−5) were identified when the CF cohort was limited to non-Phe508del homozygous subjects (n=2,226) (Table 1 and Figure 1e), and interaction analysis between the top SNP (rs374869483) and Phe508del homozygous status was significant at P=0.046. This is consistent with previous results that reported a reduction in significance of the associations for this region in the entire CF cohort compared with Phe508del homozygous subjects2 (Figure 1d). To evaluate association signals independent of the SNP with lowest P value, the imputed dosage of the conditioning SNP (rs10742326; top SNP in GWAS1+2 that is in LD with rs374869483 (R2=0.987, D′=0.995)) was included in the statistical model as a covariate (Supplementary Figure 6). No independent evidence of association was identified. In addition, SNP-set Kernel Association Test (SKAT) and Burden tests19 applied to the rare variants (MAF <0.01) in EHF, APIP, PDHX, and several regulatory features from the 377 selected samples did not identify any significant associations (Supplementary Table 2). This analysis was limited to these subjects because rare variants cannot be imputed to the entire population.

Table 1 Variants associated with CF lung disease severity with nominally more significant Phe508del homozygous P values than GWAS1+22 top SNP (rs10742326)

Together, the analysis suggests that the causal mechanism is driven by one or more common variants in strong LD (Figure 2). Indeed Haploview software (default settings)20 identified strong LD among all highly associated variants (Figures 1g–i and 2a–d). The top 5 haplotypes containing the 111 highly associated SNPs are shown in Figure 2e, with 3 most common haplotypes representing 38, 24, and 22% of all haplotypes. The fact that variants associated with disease are often in high LD and occur in context of local haplotype structure raised the possibility that the mechanism of regulation acts through multiple sites.

Figure 2
figure 2

Linkage disequilibrium (LD) and haplotype structure around top cystic fibrosis (CF) lung disease-associated region. The upper panels (ad) show the entire CF lung disease severity association region, whereas the lower panel (e) indicates the LD structure and the top 5 haplotypes of single-nucleotide polymorphisms (SNPs) with the most significant association P value (<10−7). The sections are: (a) scale bar and genome coordinates on chr11 of University of California Santa Cruz (UCSC) hg19 reference genome; (b) CF lung disease severity association P values; (c) LD plot SNP locations with respect to the genome coordinates in (a) (upward tick marks) that are then mapped to the LD plot in (d) (slanted lines); (d) LD plot generated by Haploview; (e) haplotype structure with allele genotypes and frequencies; the first SNP (rs77939918) and last SNP (rs7120959) are labeled in black font with black arrows at the bottom; the SNP of highest significance (rs10742326) is labeled in red font with a red arrow. Asterisks (*) indicate common alleles observed in the top five haplotypes.

The intergenic nature of this region suggests it acts through complex gene regulatory functions. Regulatory features downloaded from SeattleSeq Annotation 138, UCSC genome browser, ENCODE (http://www.genome.gov/10005107),2125 and Roadmap Epigenomics servers, and annotation information over the resequenced region on chr11, were collated by chromosomal locations on hg19 (Supplementary Table 3).26 Careful examination reveals complex, multiple, cell-type-specific regulatory features; for example, DNase I hypersensitive sites have been documented from human tracheal epithelium27 that are distinct from those identified in K562 cells (Figure 1j vs. k). The two DNase I hypersensitive sites in human tracheal epithelium (11.2524 and 11.2525)27 contain a large number of transcription factor binding sites (including four FOXA1 binding sites), as identified in chromatin immunoprecipitation sequencing assays (Figure 1l and Supplementary Table 4). In addition, epigenetic markers found in Calu3 cell line (goblet cell model from human lung adenocarcinoma) point to potential transcription enhancer activity overlapping the two FOXA1 binding segments (Supplementary Table 3). These features are potentially relevant given the known roles for FOXA1 in mucin production.28,29 Furthermore, the highly significant SNP rs10742325 (Table 1) has been found to show an allele-specific DNase I footprint (false discovery rate <0.05).26

Although our catalog of observed variants is still incomplete because of gaps in sequence coverage and present limitations of alignment and variant calling from short sequence reads, these results suggest that there is no obvious single sequence variant that is driving the association in the region. The mechanism ultimately must explain how the regulatory features operate in the context of cell-specific effects and explain the Phe508del homozygous-specific association observed in this region.