Introduction

The advent of massively parallel sequencing technologies has resulted in thousands of exomes and sequenced genomes, creating a catalog of genetic variation for comparison purposes that is likely to grow substantially in the coming years. The 1000 Genomes Project (1kGP) has published more than 1,000 complete human genomes representing healthy individuals.1 The National Heart, Lung, and Blood Institute (NHLBI) Grand Opportunity (GO) Exome Sequencing Project (ESP5400) sequenced whole exomes of 5,400 individuals ascertained for phenotypes related to heart, lung, and blood disorders.2 These genomes can act as controls for mutation significance in rare Mendelian disorders in which previous studies were often limited to sequencing a few hundred controls to determine if a detected variant is rare and normal or unique to a syndrome.

Orofacial clefts are common birth defects affecting ~1 in 1,000 individuals worldwide. Although most orofacial clefts are nonsyndromic, 30% are designated as syndromic due to the presence of additional physical or cognitive abnormalities. Online Mendelian Inheritance in Man (http://www.ncbi.nlm.nih.gov/omim) identifies more than 500 associations with clefting, including more than 100 Mendelian syndromes and/or chromosomal deletion/duplication associations, as well as many sporadic or family-specific associations. Of these, Van der Woude syndrome (VWS, OMIM 119300) is among the most common, accounting for ~2% of all clefts.3,4 VWS is broadly characterized by clefts of the lip and/or palate and congenital paramedian lower lip pits.5 There is considerable variability in the phenotypic expression of VWS, which can range from lip pits alone to bilateral cleft lip and palate.6 VWS is one of two dominant allelic disorders caused by heterozygous mutations in IRF6.7 The popliteal pterygium syndrome (PPS, OMIM 119500) is characterized by the clinical features of VWS (clefts of the lip and/or palate and lip pits), with additional features that include webbing of skin behind the knees, genital anomalies, syndactyly, oral adhesions, and other anomalies.8 The IRF6 gene is one of a family of nine IRF genes that code for transcription factors that share a highly conserved helix-turn-helix DNA-binding domain (DBD) and a less conserved protein-binding domain.7 Since its identification as the gene mutated in VWS and PPS, hundreds of mutations in IRF6 have been reported.7,9,10,11,12,13,14,15,16,17,18,19,20

Given the rarity of VWS and PPS (1/35,00021 and 1/300,000,22 respectively), rates of mutations in functional elements of IRF6 in any individuals sequenced as normal controls or as part of unrelated disease-based cohorts would be expected to be low. Therefore, the resources of the 1kGP and the ESP5400 represent a control cohort that is larger than any previously available. Approximately 300 pathogenic variants in IRF6 in individuals with VWS or PPS have been identified. To determine if any of these variants could be extremely rare but normal variants in the general population, we compared the list of previously reported IRF6 variants to the 1kGP and the ESP5400 databases.

The work by de Lima et al.9 described the distribution of IRF6 mutations with the goal of identifying the exons most likely to carry mutations. This was clinically useful for prioritizing mutation discovery efforts and suggested broad genotype–phenotype relationships, but categorizing the mutation distribution by exon does little to refine the regions of IRF6 important for function. Since then, additional mutations have been reported, and we were able to carefully characterize the distribution of IRF6 variants. This allowed us to identify the residues whose disruption is likely to be damaging (as the etiologic cause of VWS or PPS) and to further define the domains of the protein most critical for IRF6 function in craniofacial development. This is biologically significant because it allows us to prioritize mutations for functional studies and offers insight into structure–function relationships for IRF6 and other members of this highly conserved family of transcription factors. In addition, examining the spectrum of IRF6 variation present in VWS or PPS and the whole-exome databases provides a benchmark for clinically interpreting IRF6 variants from future whole-exome or whole-genome sequencing projects.

Materials and Methods

Compilation of mutation data

To identify published IRF6 mutations, we performed a PubMed search using the following terms: “IRF6,” “Van der Woude syndrome,” “VWS,” “popliteal pterygium syndrome,” and “PPS.” Additional mutations were obtained from the clinical sequencing database at GeneDx (Gaithersburg, MD) or reported from research sequencing in our laboratory. Control variants were obtained from the 1kGP (1,091 individuals, February 2012 data release) and the NHLBI ESP (5,379 individuals, ESP5400). Variants from the 1kGP were annotated using the SeattleSeq SNP annotation software (Build 134, http://snp.gs.washington.edu/SeattleSeqAnnotation134/). Several mutations have been previously reported to cause VWS in exons 1 and 2, which make up the 5′ untranslated region. These mutations create an alternate upstream start codon and are predicted to create truncated IRF6 proteins. We categorized these with missense mutations at position M1 as mutations that “alter the start codon.” No sequencing data are available for exons 1 and 2 from whole-exome sequencing due to the limitations of the exome capture technique, so we restricted analysis to mutations in the seven protein-coding exons of IRF6.

Sequencing

Twenty-three previously unreported cases of VWS or PPS had exons 1–8 and the protein-coding part of exon 9 amplified using primers previously published by de Lima et al. PCR products were sent for sequencing using an ABI 3730XL (Functional Biosciences, Madison, WI). Chromatograms were transferred to a Unix workstation, base called with PHRED (v.0.961028), assembled with PHRAP (v. 0.960731), scanned by POLYPHRED (v. 0.970312), and viewed with the CONSED program (v. 4.0).

Bioinformatics

Polyphen223 and sorting intolerant from tolerant (SIFT)24 were used to predict the damaging effects of missense mutations. Sequences of IRF6 orthologs from 17 species were obtained from the Ensembl database (human (NP_006138.1), chimpanzee (ENSPTRP00000003274), gorilla (ENSGGOP00000006076), macaque (ENSMMUP00000029056), bushbaby (ENSOGAP00000009580), mouse (NP_058547.2), rat (NP_001102329), guinea pig (ENSCPOP00000015286), rabbit (ENSOCUP00000003494), cow (ENSBTAP00000054388), cat (ENSFCAP00000014042), dog (ENSCAFP00000017624), elephant (ENSLAFP00000001156), pig (ENSSSCP00000016552), chicken (ENSGALP00000039479), lizard (ENSACAP00000005847), frog (ENSXETP00000040424), and zebrafish (ENSDARP00000061534)). Sequences were aligned using ClustalW and viewed in Jalview (version 2.7),25 which also provided a numerical conservation score based on the chemical properties of the amino acids in the alignment.

Statistical analysis

Frequencies were calculated using the number of variants per 100 bp to account for differences in exon size (ranging from 72 to 393 bp) for only exons 2–9, which encode the IRF6 protein. To visualize the distribution of variants, we performed a sliding window analysis using a 33–amino acid (99 bp) sliding window with a step size of 1 amino acid (3 bp). The expected number of variants per 100 bp was calculated from the total number of variants evenly distributed along the length of the IRF6 cDNA (1,404 bp). Normalized variant counts per exon were compared with expected counts using a 1 degree of freedom χ2 test. To account for multiple comparisons, we established a Bonferonni significance threshold of P = 0.007 (i.e., 0.05/7). The 2 × 7 tables showing the distribution of variants in the exons of IRF6 were analyzed using χ2 or Fisher’s exact test. We also compared the distribution of mutations in the known IRF6 protein domains (DNA binding, protein binding, and other) in the same manner. Wilcoxon rank-sum test was performed to determine the difference in conservation scores between missense variants cases and controls using STATA (version 12.0; StataCorp, College Station, TX).

Results

Prevalence of IRF6 mutations

There were 295 distinct mutations identified in 549 families with VWS or PPS (Supplementary Table S1 online). Missense mutations were the most common (51.7%), whereas a large portion (40.5%) of the remaining variants resulted in a truncated IRF6 protein (nonsense, frameshift, and altered start codons) ( Table 1 ). For the families in which the syndrome was specified, we compared the types of mutations causing VWS with the types causing PPS (Supplementary Table S2 online). Every category of mutation was represented among VWS families, whereas PPS mutations were limited to missense, nonsense, and splicing mutations.

Table 1 Distribution of IRF6 mutations in Van der Woude syndrome/popliteal pterygium syndrome by type of mutation

CpG dinucleotides are common mutation hotspots due to methylation and spontaneous deamination of cytosine to thymine. There were 24 CpG dinucleotides in the protein-coding exons of IRF6, and C→T or G→A transitions at these dinucleotides could create 35 different missense or nonsense changes (Supplementary Table S3 online). In VWS and PPS cases, 9 of the 24 CpG dinucleotides had C→T or G→A transitions, totaling 15 different mutations. Although these mutations are a small fraction (5.3%) of the total number reported in IRF6, they are responsible for VWS or PPS in ~30% of families (P = 1.6 × 10−5). Missense mutations at CpG dinculeotides were responsible for half of PPS families (Supplementary Table S2 online), and the majority of these were either R84C or R84H.

We analyzed the distribution of mutations in the seven protein-coding exons of IRF6, normalized to 100 bp to account for differences in exon size. The average frequency was 25 mutations per 100 bp. The frequency was significantly higher in exon 4 (P = 5.07 × 10−8) and lower in exon 5 (P = 4.07 × 10−5) ( Figure 1 , Table 2 ). Overall, the distribution of mutations in the protein coding exons was nonrandom (P = 1.37 × 10−4). We observed a similar pattern when accounting for independent mutations (i.e., the same variant in two families counted once) (Supplementary Tables S4–S6 online).

Figure 1
figure 1

Sliding-window analysis of IRF6 variants. A window size of 99 bp was used with a step size of 3 bp. Variant counts included or excluded the 15 variants at CpG sites. Red and purple represent counts in Van der Woude syndrome/popliteal pterygium syndrome families; blue and green represent counts in 1kG/ESP5400 controls. Dashed lines indicate the expected number of variants per 100 bp if the total number of variants were distributed evenly across IRF6 (expected cases = 25; expected controls = 2).

Table 2 Distribution of variants in protein-coding exons of IRF6

Approximately 90% of the mutations causing VWS or PPS are missense or truncation mutations, and the remaining 10% include those predicted to create new start codons or affect splicing. Excluding mutations at CpG dinucleotides, the frequency of missense mutations was significantly increased in exon 4 (P = 6.41 × 10−11) ( Table 3 ). There was also a significant decrease in missense mutations in exon 5 (P = 2.27 × 10−4) and exon 6 (P = 6.73 × 10−4). Overall, missense mutations were nonrandomly distributed (P = 7.21 × 10−7). However, truncation mutations were evenly distributed across IRF6 (P = 0.49).

Table 3 Distribution of VWS/PPS missense and truncation mutations in protein-coding exons of IRF6

Distribution of mutations in IRF6 domains

IRF6 contains an N-terminus DBD, encoded in exons 3 and 4, and a C-terminus protein-binding domain, encoded in exons 7–9. The frequency of mutations was increased for the DBD ( Table 4 ) (P = 4.09 × 10−4) but not in the protein-binding domain (P = 0.37). The overall distribution was only marginally nonrandom when considering the number of families with mutations in each domain (P = 0.05).

Table 4 Distribution of variants in domains of IRF6

Prevalence of IRF6 variants in controls

We analyzed sequences from 6,470 individuals from the 1kGP and the ESP5400 for variants in IRF6 (Supplementary Table S7 online). Although 22 different missense variants were identified, we excluded V274I (rs2235371) from analyses because it is commonly observed in 3% of European and 30% of Asian individuals. The remaining 21 missense variants were distributed among only 33 individuals (0.5% of controls). Seven were C→T or G→A transitions at CpG dinucleotides. The average rate was two variants per 100 bp, and these were evenly distributed throughout IRF6 ( Table 2 , P = 0.91). Overall, the distribution of missense variants was significantly different between cases and controls (P = 3.4 × 10−4).

Known pathogenic variants in controls

We compared variants reported in VWS and PPS cases with those from the 1kGP and the ESP5400. There were two variants in common between these data sets. The first, R45Q, was reported by Kayano et al.26 in a father with lip pits and his daughter, who had cleft lip and lip pits. It was found in a Luhya female from the 1kGP. This variant is highly conserved and is predicted to be damaging by both Polyphen2 and SIFT. The second variant, D354N, was first reported by Jehee et al.19 in an individual with cleft palate and her unaffected mother. D354N was also found in two individuals with VWS sequenced at GeneDx. Jehee et al.19 showed that this variant reduces the GeneSplicer-predicted splicing score from 4.40 to 1.75. The aspartic acid residue is conserved among primates and placental mammals but is predicted to be benign by both Polyphen2 and SIFT. In addition, this variant was identified in three European Americans in the ESP5400 cohort.

Bioinformatic predictions for IRF6 missense variants

Many computational tools have been developed to discriminate pathogenic and benign variants from sequence data.27 SIFT24 and Polyphen223 are two such tools. We compared a set of missense variants identified from VWS and PPS cases with whole-exome or genome-sequencing databases using both programs. Although Polyphen2 predicted that only 5.9% of the missense variants in VWS and PPS cases are benign, SIFT was more conservative and predicted a higher percentage (42.3%) to be benign (Supplementary Figure S1 online). SIFT similarly predicted more of the 1kGP and ESP5400 variants to be benign (95.2% vs. 76.2% by Polyphen2). Amino acid conservation is another metric for evaluating pathogenicity of rare variants. Missense variants in VWS and PPS had higher conservation scores (average, 10.64) than missense variants in 1kGP/ESP5400 (average, 7.31); this was statistically significant (P = 8.14 × 10−16).

Discussion

To date, 237 different IRF6 mutations have been published, not including numerous deletions of IRF6. Here, we report an additional 63 pathogenic variants. In 2009, de Lima et al.9 described the distribution of IRF6 mutations as nonrandom, with more mutations in exons 3, 4, 7, and 9. Not surprisingly, de Lima et al.9 also reported more missense mutations in the DNA-binding and protein-binding domains, encoded by exons 3–4 and 7–9, respectively. We revisited this result with a larger sample and found that mutations causing VWS and PPS are only overrepresented in exon 4, encoding part of the DBD. There are several differences in our analysis to account for this disparity. First, we normalized frequencies to 100 bp to account for variability in exon size. Exon 7 is the largest at 393 bp, therefore it follows that more opportunities for mutation would result in more variants in this exon. Second, we excluded mutations at hypermutable CpG dinucleotides from the analysis. The increase in mutations in exons 7 and 9 observed by de Lima et al.9 can be attributed to the hotspots R250X (exon 7), R400W (exon 7), and R412X (exon 9). By excluding these mutations, which can be attributed to a specific, high-frequency mechanism, we can look for patterns in distribution of the remaining mutations to gain insight into amino acid residue function for IRF6 and the pathogenesis of VWS and PPS.

Previous work by de Lima et al.9 described five mutational hotspots (R6, R84, R250, R400, and R412) and attributed the high-mutation frequencies to the CpG dinucleotides in these codons.9 We systematically analyzed the 24 CpG dinucleotides in the coding exons of IRF6 and found that not all of these CpG dinucleotides are mutated. From this data set, we cannot determine why some of these “hotspots” are mutated and others are not, or why some mutations occur more frequently. For example, R84C and R84H have been identified in 70 families with VWS or PPS, but V433I has not been reported. Mutations at R84 are highly associated with PPS, therefore some ascertainment bias should be taken into consideration regardingthe preponderance of R84 mutations in this data set. Although V433I may not cause a phenotype, V433I was predicted to be damaging and has not been found in 500 families with VWS/PPS or in 6,500 controls. It is possible that V433I causes a particular phenotype that has yet to be observed or described. Similarly, it is also possible that this amino acid is critical to the function of IRF6 and mutation of this residue is lethal. Finally, it could be that not all of the CpG dinucleotides are methylated, and these nucleotides are nomore mutable than any other nucleotide, possibly explaining its absence in more than 7,000 samples. Although the methylation status of the IRF6 promoter has been studied in human squamous cell carcinoma cells,28 it has not yet been empirically determined within the gene itself or directly from tissues obtained during craniofacial development.

We saw an increased frequency of mutations in the DBD. It has been hypothesized that VWS is caused primarily by loss-of-function mutations and PPS is caused by dominant negative mutations9,29 because mutations affecting DNA binding (i.e., R84C, R84H) are highly associated with PPS, whereas truncation mutations are more commonly found in VWS. We found a similar result: 67% of the PPS families had missense mutations at R84C or R84H. However, some families with R84C mutations have VWS and some families with nonsense mutations have PPS,9 suggesting a more complicated mechanism. If missense mutations causing VWS were primarily protein destabilizing, we would expect them to be distributed evenly between the DBD and protein-binding domain. However, we observed a significant increase in mutations in the DBD, highlighting the importance of this domain. Specifically, this increase is centered in exon 4, where the majority of the residues contacting DNA are located. Exclusion of mutations at residues contacting DNA did not eliminate the enrichment of missense mutations in the DBD (data not shown). The remaining missense mutations in the DBD may be protein destabilizing, but may also prevent DNA binding in another way, or have some as-yet-unknown effect on IRF6 function.

Although our results suggest the DBD should be a focus of further biological investigation, mutation of the protein-binding domain is clearly important for the pathogenesis of VWS or PPS. Even though the exons that encode this domain are not enriched for mutations over what was expected, mutations at hotspots R250, R400, and R412 in these exons account for a significant portion (12%) of the mutations in families with VWS/PPS. Therefore, for PCR amplification–based sequencing for mutation detection, we recommend continuing the tiered approach proposed by de Lima et al.9

Some of the criteria proposed for the classification of pathogenic variants include amino acid conservation, in silico prediction (i.e., Polyphen2 and SIFT), and presence or absence in databases such as dbSNP.30 Polyphen2 and SIFT are two of a host of programs in use for mutation interpretation27;27 these programs, although popular, can have low sensitivity and specificity.31 Here, we show that for missense variants in IRF6, Polyphen2 and SIFT predictions were not in perfect agreement. In some cases, such as the mutation L22P, previously shown to abrogate DNA binding in vitro,29 the predictions contradict the experimental evidence. However, only a handful of mutations in IRF6 have been functionally tested, therefore it is impossible to know the true sensitivity and specificity of these in silico prediction programs in this case. In the current era of whole-exome and whole-genome sequencing, it has become important to be cautious in labeling a genetic variant pathogenic, particularly in novel genes. In the case of VWS or PPS, in which it is clear that IRF6 mutations play a role, it is still beneficial to use caution when using in silico programs. Similarly, although we show a significant difference in amino acid conservation between the variants causing VWS or PPS and those found in controls, conservation alone is not sufficient to determine pathogenicity of a variant.

IRF6 variants were found in just 0.5% of controls, suggesting that mutation of IRF6 is uncommon and variants identified in patients are likely to be truly disease causing. Furthermore, when carrier testing and/or prenatal testing are considered, families can have high confidence in the results. However, we identified two variants (R45Q and D354N) previously reported as disease causing in controls. Given the frequency with which we found D354N in controls and its relatively low conservation, it may be a rare polymorphism. Coincidentally, this variant is consistent with a deamination mechanism at this CpG site. However, R45Q is highly conserved and segregated in a family with VWS.26 Samples from the 1kGP are anonymous and have no associated phenotype data, although they are presumed healthy. VWS is a rare disorder, with an estimated frequency of 1/35,000, therefore it is unlikely that one of these controls has VWS. However, VWS is a syndrome with marked variable expressivity in which 44% of affected individuals only have lower lip pits,6 which could easily be overlooked without careful examination.

A clear limitation of using the 1kGP and ESP5400 cohorts as controls is the lack of individual phenotypes or family history. As noted above, given the frequency and penetrance of VWS and PPS, it is unlikely that one of these controls would have a mutation and also be unaffected. However, for other disorders, with higher frequency and/or lower penetrance, even more caution must be used in interpreting variants from the 1kGP and ESP5400 cohorts. Another limitation to this study is the racial and ethnic heterogeneity of the VWS and PPS cases and the control cohorts. Mutations causing VWS or PPS have been reported in geographically diverse populations, but previous work found no difference in the distribution of IRF6 mutations.9 Although the 1kGP has sampled populations worldwide, the majority of the controls used in this study come from the ESP5400, which consists of European and African-American populations.

The fact that we identified potentially disease-causing alleles in the samples from the 1kGP and the ESP5400 cohorts is not surprising. Most recently, it was estimated that these individuals carry at least 100 loss-of-function variants.32 Not all mutations in IRF6 cause VWS or PPS, IRF6 mutations are only found in 70% of individuals with VWS/PPS,9 and there are some cases of nonpenetrance.21 Therefore, it is important to have a large reference panel of normal controls before families make decisions on prenatal diagnosis or nonpenetrant carrier status based on finding an amino acid variant that has not been previously reported for VWS or PPS. It is also possible that there are other phenotypes resulting from mutation of IRF6; this is certainly possible given that Irf6 is expressed in a variety of embryonic and adult tissues, including the placenta, liver, and lung.7 Large-scale whole-exome sequencing may be able to identify mutations responsible for the remaining 30% of VWS/PPS cases and, when applied clinically, will be able to uncover any additional phenotypes resulting from IRF6 mutation. Eventually, it will be possible to define amino acid residues that are and are not critical for IRF6 function.

By analyzing the prevalence and distribution of IRF6 variants in individuals with VWS or PPS and controls, we have demonstrated that mutation of IRF6 occurs infrequently. When mutation does occur, particularly in conserved domains, it is likely to result in VWS or PPS. Therefore, we can say that IRF6 does not tolerate a high mutational burden. Further studies of other genes and disorders will be of great importance for interpreting the variants coming from clinical whole-exome or whole-genome sequencing.

Disclosure

S.B. and J.C. are employees of GeneDx. E.J.L., J.S., B.C.S., and J.C.M. declare no conflict of interest.