Main

The 9.7-kb SFTPB (GeneID: 6439 Locus tag: HGNC 10801; MIM 178640) encodes a 79-amino-acid hydrophobic protein that is critical for function of the pulmonary surfactant (1). Functional pulmonary surfactant, a phospholipid-protein mixture that lines alveoli at the air-liquid interface, maintains alveolar patency at end expiration and is required for successful fetal-neonatal pulmonary transition. Studies in human newborn infants with rare recessive loss of function SFTPB mutations have demonstrated that genetic disruption of SFTPB expression is completely penetrant and lethal due to dysfunction of the pulmonary surfactant (2,3). Studies in conditionally regulated murine lineages and human infants indicate that >75% reduction in SFTPB expression is sufficient to cause surfactant dysfunction and respiratory distress (4,5). To provide a catalogue of SFTPB variants [single nucleotide polymorphisms (SNPs) or in/dels] for use in statistical and functional studies of SFTPB regulation, we used high-throughput, comprehensive resequencing of SFTPB in a cohort of sufficient size (n = 1116) to detect low frequency variants. We report an excess of low-frequency variation, high rates of intragenic recombination, and a lack of common damaging exonic variants. Our results suggest that comprehensive resequencing will likely be advantageous over tagSNP genotyping approaches in genetic association analysis of SFTPB.

METHODS

Automated amplification and sequencing.

We extracted genomic DNA from 1116 Guthrie cards collected for newborn screening by the Missouri Department of Health and Senior Services (DHSS) (6). We linked each DNA sample anonymously to clinical characteristics in a vital statistics (birth-death certificate) database maintained by the Missouri DHSS to determine ethnicity. Using small amplicons [<500 base pairs (bp)], robotic high-throughput automated processes, and BigDye terminator sequencing chemistry (7), we bidirectionally sequenced SFTPB, including 1.8 kB of the promoter region, 1.1 kB of exonic sequence (all 10 translated exons), and 5.9 kB that includes all intervening intronic sequence except 380 bp (genomic position 1649–2028) in intron 4. We omitted part of intron 4 due to the inability of BigDye terminator sequencing chemistry to resolve variable numbers of dinucleotide repeats in this region (8). We also omitted one untranslated exon (exon 11), and its preceding intron (intron 10). All amplification and sequencing primers and conditions are available at http://genome.wustl.edu/activity/med_seq/primers.cgi. We used software applications (Phred, Phrap, PolyPhred, and Consed) to call bases, assemble contigs, and scan sequencing chromatograms for variation (http://www.phrap.org/phredphrapconsed.html).

To assess overall sequence quality, we used a quality-averaging program (J. Sloan, University of Washington) to quantify Phred score at each base across SFTPB (Fig. 1). Because of variation in trace file quality, analysts reviewed and confirmed or edited all polymorphic sites identified by Polyphred, sites with in/dels, and all sites previously identified as polymorphic in dbSNP in each individual. After manual polymorphism validation, we extracted genotypes for each DNA sample at the confirmed polymorphic sites for analysis. An average of 90% of genotypes were called in each individual using a minimum Phred score of 20.

Figure 1
figure 1

Average Phred score by genomic location in SFTPB. Average Phred score calculated and averaged at each site on finished SFTPB sequence. Sequence quality in intron 4 was low due to the inability of BigDye sequencing chemistry to resolve multiple CA repeats.

False-positive and -negative rates of SNP discovery.

Because of the high proportion of sequence variation attributable to rare polymorphic sites, we were concerned that SNP detection errors might bias our analysis. Systematic comparison of the results from two independent analysts identified 0.99% of calls as discrepant (452/45,505 genotypes): 67% of these were judged as false-positive calls (301/452) in low-quality (Phred score <20) data, and all discrepant calls were classified as missing data. Using an independent genotyping method, Taqman (9), we compared genotypes at five high-frequency polymorphic sites in 558 individuals to the genotypes called from sequence data and found 27 discrepant calls in 2790 genotypes, with 10 confirmed Taqman heterozygotes, for a false-negative heterozygote detection rate of 0.36%. Next, we reamplified and resequenced all heterozygous sites identified in fewer than three individuals (41 genotypes in 49 individuals) with different primer sets and confirmed genotypes at all these sites. Finally, we examined base calls and sequence quality (Phred score) at 42 sites polymorphic in other cohorts but not in this cohort (45,780 genotypes). Of the 41,555 genotypes with high-quality (Phred score >20) sequence, we found no rare alleles missed by chromatogram analysis (0%). We could not call the remaining 5317 genotypes (11.6%) due to low-quality chromatograms in those specific samples. These results suggest false-positive and -negative rates of less than 1%.

LD, haplotype estimation, recombination rate, and hot spot location determination.

LD is a measure of the allelic correlation between two SNPs. Several LD statistics are available (10); D′ is the ratio of the observed LD to the strongest possible LD given the allele frequencies of the SNPs. |D′| = 1 when there is no detectable recombination between SNPs. Haplotypes are patterns of alleles across multiple SNPs along a single chromosome. We used PHASE (v. 2.1) to infer haplotypes computationally from genotypes within each racial group (11,12). To assess whether haplotypes of common variants [minor allele frequency (MAF) >5%] can predict genotype at low-frequency SFTPB alleles, we used HAPLOVIEW (v. 3.31) (http://www.broad.mit.edu/mpg/haploview/) in aggressive mode to select a minimal set of tagSNPs such that all other SNPs were strongly correlated (r2 ≥ 0.8) with either a tagSNP or a haplotype of several tagSNPs (13). We used PHASE to estimate background recombination rate, determine hot spot location, and compute Bayes factors (BFs) as previously described (14) for either intragenic SNPs with MAF >5% or for HapMap SNPs (MAF >5%) within 50 kB of SFTPB (data release #21 as of July 2006) (http://www.hapmap.org). BFs are likelihood ratios of the probability of the observed data assuming a recombination hotspot divided by the probability of the data assuming uniform recombination across the region. A BF of 10 suggests that the haplotype data at a genomic location are 10 times more likely to be consistent with the presence of hot spot than the absence of a hot spot, and a BF of >10 is substantive evidence for the presence of a recombination hot spot.

Molecular evolution.

Discovery of genomic regions under selective pressure may help inform genetic association studies because evolutionarily constrained sequences are presumably functional. We used three statistical strategies to screen SFTPB for selective pressure. To assess whether genetic variation in regions of SFTPB was consistent with neutral evolution, we used two statistical tests of observed sequence diversity against theoretical predictions for neutral sequence, Tajima's D (15) and the Fu and Li D* (16). Tajima's D compares two descriptive statistics (θ and π) for sequence diversity: θ is based on the number of chromosomes screened and the number of polymorphisms observed in SFTPB (17), whereas π is based on the number of chromosomes screened and the average allele frequency of the polymorphisms identified (18,19). We used SLIDER (http://genapps.uchicago.edu/slider/index.html) to calculate Tajima's D. The Fu and Li D* compares π with a third sequence diversity statistic derived from the number of singleton polymorphisms observed (SNPs with the rare allele observed only once in the data) (19).

We also characterized selection pressure by using the ratio of nonsynonymous to synonymous substitution rates (dN/dS) calculated from the observed SNPs using SNAP (Synonymous/Nonsynonymous Analysis Program) (http://www.hiv.lanl.gov/content/hiv-db/SNAP/WEBSNAP/SNAP.html) (20,21). A dN/dS ratio >1 suggests more nonsynonymous substitutions than expected under the neutral model and is evidence of positive selection, whereas a dN/dS ratio <1 is evidence of purifying selection against some amino acid replacement mutations.

The third statistic that we used was the MacDonald-Kreitman test (22), which compares the within-species dN/dS ratio for polymorphism in our sample against the between-species ratio for fixed differences (23) (http://www.ebi.ac.uk/clustalw/).

Statistical methods and Human Studies Committee approval.

We analyzed all data using Statistical Analysis System (v. 9.3.1) (SAS, Inc., Cary, NC). The Human Research Protection Office at the Washington University Medical Center and the Institutional Review Board at the Missouri DHSS reviewed and approved this study.

RESULTS

Genetic variant discovery.

We were unable to screen 380 bp of intron 4 due to a highly polymorphic CA repeat region. In the remaining sequence, we found 86 polymorphic sites including 81 SNPs and five small in/dels (9.8 polymorphic sites per 1000 bp of SFTPB reference sequence), with similar frequencies in the promoter (eight per 1000 bp), introns (10 per 1000 BP), and exons (12 per 1000 BP) (χ2 analysis, p = 0.7) (Table 1). The overall SNP density was 9.2/1000 bp. The Phred scores within 10 bp of each polymorphic site (37 ± 6) (mean ± SD) were excellent, suggesting that sequence quality did not limit genetic variant discovery (Fig. 1). The average number of polymorphic sites per individual was greater in African-Americans than other races (all p < 0.01) (Table 1). The race-specific relative genotype frequencies at each polymorphic site did not differ significantly from Hardy-Weinberg prediction (all p > 0.05). The majority of variant sites in SFTPB are low frequency: 67 of 86 sites had MAF <5%. Potentially disruptive variants were also rare: eight of nine nonsynonomous variants and six of seven intronic SNPs within 20 bp of an intron-exon junction were rare. To determine whether nonsynonymous SNPs might disrupt surfactant protein B function, we used two homology-based software tools, SIFT (Sorting Intolerant from Tolerant) (24) and PolyPhen (25). We found that eight of nine sites were not classified as intolerant or damaging. One site (genomic position 2558) in exon 5 that encodes either glycine or glutamic acid (G183E) was classified as probably damaging by Polyphen, but tolerated by SIFT, and is rare (MAF 0.1%). The lack of definitively damaging or intolerant SNPs in this large cohort suggests strong, purifying, selective pressure against rare variants that encode dysfunctional surfactant protein B, likely due to the critical role of the encoded protein in successful fetal-neonatal pulmonary transition (26). Despite evaluating a much larger cohort size (1116 versus 90 individuals from the Polymorphism Resource Discovery panel), these estimates are considerably lower than estimates of damaging exonic variants in 213 environmental genes (27).

Table 1 SFTPB polymorphic sites by race in the Missouri cohort

To determine whether variants at intron-exon junctions might disrupt expression, we used a neural network application (http://www.fruitfly.org/seq_tools/splice-instrucs.html) trained to recognize potential human splice sites based on a large training set of known human splice sites. We found that the only common intron-exon junction SNP (genomic position 4550, rs893159) was predicted to alter RNA splicing by creating a second acceptor site for exon 8. The score for a second acceptor site increased from 0.47 to 0.78 when the minor allele was substituted, whereas the score for the predicted exon 8 acceptor site is 0.65. This finding suggests that RNA splicing may be altered by this SNP.

To validate experimentally, a published mathematical simulation of the number of haploid genomes required to detect SNPs with minimum allele frequency greater than a given frequency (28), we performed 1000 race-stratified sampling iterations for SFTPB (Table 2). Our data for SFTPB confirm the theoretical prediction based on the standard neutral model of population genetics, show that a cohort size of ≤48 haploid genomes will miss 11%–18% of SNPs with frequencies of ≥1% and provide direct evidence of the influence of population history on estimates of cohort size necessary to detect rare SNPs.

Table 2 Detection rate for SNPs with a given minimum allele frequency in SFTPB

LD.

Statistical power of genetic association studies may be increased, and genotyping costs decreased by identifying highly correlated tagSNPs. LD is a statistical measure of allelic correlation between polymorphisms. Using common genotypes (MAF >5%), we detected weak LD across SFTPB despite its small genomic size (Fig. 2). In view of the effect of cohort size on LD, we randomly selected European-American cohorts similar in size to the African-American cohort and found similar results (29,30). Using the tagger function in HAPLOVIEW, we were unable to capture rare variants when using common markers as tagSNPs. Using the Genome Variation Server maintained by Seattle SNPs (http://gvs.gs.washington.edu/GVS), we found weak LD within SFTPB. Weak LD suggests that the genomic region that includes SFTPB spans a recombination hot spot (14).

Figure 2
figure 2

VG2 plot of LD (D′) within SFTPB using common genotypes (MAF >5%) in African-American and European-American infants. Weak LD is present in both populations across the entire gene.

Haplotype diversity, estimation of background recombination rate, and recombination hot spot determination.

We used PHASE with common genotypes (MAF >5%) to infer haplotypes (Fig. 3) and observed high haplotype diversity consistent with intragenic recombination. To determine whether SFTPB includes a recombination hot spot, we estimated recombination parameters into PHASE and calculated BFs, a measurement of the strength of the evidence of a recombination hot spot (14). In the resequencing data alone, the intragenic recombination rate over background (Fig. 4A) and BF values (5.9 in European-American, 2.2 in African-American) did not suggest a recombination hot spot. However, when we calculated the recombination rate and BFs for a 107-kb window flanking SFTPB in HapMap data, we found a 20- to 80-fold increase in the recombination rate within SFTPB (Fig. 4B), and BF values of 1353 in both populations. As suggested by comparison of BFs with background recombination rates in each of these cohorts (Fig. 5), the high intragenic recombination rate was not detected in the resequencing data because the recombination hot spot spans most of the resequenced region.

Figure 3
figure 3

VH1 plot of common SFTPB haplotypes in African-American and European-American infants. Using 17 SNPs (African-American) and 10 SNPs (European-American), we estimated computationally 82 unique African-American haplotypes and 80 unique European-American haplotypes. Most haplotypes (59/82 African-American haplotypes and 59/80 European-American haplotypes) were rare (<1%). Arrows indicate haplotypes with frequencies >5%, and individual haplotype frequencies for common haplotypes are provided.

Figure 4
figure 4

Changes in recombination rate vs background recombination rate in the Missouri cohort within SFTPB (A) and in the HapMap Project Yoruban and European cohorts (http://www.hapmap.org) near (±50 kb) SFTPB (B). (A) In SFTPB, little change in recombination rate is detectable; positions of translated exons are shown in numbered blue boxes. (B) Near SFTPB, a 20- to 80-fold increase in recombination rate is present in SFTPB.

Figure 5
figure 5

Strength of evidence of a recombination hot spot within SFTPB using the Missouri cohort (SFTPB genomic region) and the HapMap cohort (http://www.hapmap.org/) (100-kb region that includes SFTPB). EA, European-American (Missouri cohort); AA, African-American (Missouri cohort); Yri, Yoruban population (HapMap Project); Ceph, European descent population (HapMap Project, data release #21 as of July 2006).

Molecular evolution.

To test whether SFTPB variation is consistent with predictions from the neutral theory of molecular evolution, we used Tajima's D and the Fu and Li D* (Table 3). Both measures were consistently negative for both African-Americans and European-Americans, suggesting an excess of low frequency variation in SFTPB, although this trend was not significant. Using a sliding window approach (Fig. 6) (19), we found that the genomic region that encodes mature surfactant protein B (exons 6 and 7) had the most negative values, consistent with negative selection against variation in these exons. To evaluate conservation across species, we compared dN/dS in this cohort with SFTPB in Mus musculus (GenBank number NM147779). The overall dN/dS ratio for this cohort was 2.0 (eight nonsynonymous and four synonymous sites). In a human-mouse comparison, SNAP determined the dN/dS ratio to be 0.94 (two nonsynonymous and 2.12 synonymous) across these two species, consistent with neutral evolution over time. The MacDonald-Kreitman test was also consistent with neutral evolution (χ2 = 0.43, p = 0.51). These results suggest that although much of the variation in SFTPB is selectively neutral, the excess of low-frequency variation near the exons containing mature SFTPB may be attributable to the presence of a modest number of mildly deleterious polymorphisms subject to negative selective pressure.

Table 3 Nucleotide diversity and neutrality tests in European-Americans and African-Americans for coding and noncoding regions and for synonymous and nonsynonymous SNPs
Figure 6
figure 6

Comparison of Tajima's D (A) and the Fu and Li D* (B) across SFTPB for the Missouri cohort using a 900-bp sliding window. EA (♦), European-American infants; AA (▪), African-American infants; positions of translated exons shown in numbered blue boxes.

DISCUSSION

Because neonatal respiratory distress syndrome is unambiguously associated with rare recessive SFTPB mutations and is observed when SFTPB expression is reduced by >75% (2–5), SFTPB is a candidate gene for neonatal respiratory distress syndrome. Previous studies using unrelated case-control designs or family-based association tests with genotypes at high-frequency polymorphic sites have suggested an association between genotypes or haplotypes and neonatal respiratory distress (31–33). To inform studies of genetic regulation of SFTPB, we adapted production level, polymerase chain reaction–based sequencing technology for comprehensive genetic variant discovery (7). We found high SNP density (28,34), weak LD, and, using data from the HapMap Project, strong evidence of a recombination hot spot within SFTPB. The coincidence of high SNP density, excess low-frequency sites, and high recombination rate has been observed at other loci in Drosophila and humans (35–37), consistent with an increased mutation rate within recombination hot spots. These characteristics suggest that use of common SFTPB haplotypes or tagSNPs will not capture statistically robust associations with disease-causing alleles in unrelated, genetically diverse case-control cohorts (38). Genetic bottlenecks in small populations will increase LD, but typically do so only for rare subsets of SNPs. LD between the higher frequency SNPs will not be substantially altered by bottlenecks in founder populations. Thus, at SFTPB, comprehensive resequencing in large case-control cohorts is advantageous for genetic association studies of neonatal respiratory distress syndrome because the elevated mutation rate enhances the frequency of rare deleterious mutations, whereas the high recombination rate makes LD between common SNPs too low for useful tagSNP selection. In view of the lack of common damaging exonic SNPs observed in SFTPB, association studies of neonatal respiratory distress syndrome will need to focus on regulatory variation. For example, our data using a neural network application trained to recognize potential human splice sites suggest that the intron-exon junction SNP at genomic position 4550 (rs893159) may alter RNA splicing, resulting in misprocessed or misdirected surfactant protein B and disrupting surfactant function. Our results also suggest the value of mechanistic studies in the genetic pathogenesis of SFTPB mutations. A second SNP in intron 2 (SNP 1013, rs3024798) may affect recombination rates within SFTPB because it disrupts a motif in intron 2 (CCTCCCT > CCTCCAT) that has been associated with recombination hot spot activity (39). Recombination rates correlate positively with mutation rates, so high recombination rate alleles may be more prone to the de novo SFTPB mutations seen in severe neonatal respiratory distress syndrome.