Introduction

Cleft lip and/or palate (CLP) phenotypes are among the most frequent birth defects occurring at rates of 1/500–1/2500 births1. A proportion of cases present with syndromic disease (CLP in addition to a spectrum of additional phenotypes) mostly caused by rare mutations in single genes that often show Mendelian patterns of inheritance. However up to 70% of cases show phenotypes lacking any additional cognitive or craniofacial abnormalities, referred to as nonsyndromic cleft lip and/or palate (NSCLP). Such phenotypes are regarded as genetically complex arising through the interplay of numerous genetic and environmental factors. Increased understanding of the underlying aetiology of NSCLP phenotypes (both genetic and environmental) is needed to ultimately develop strategies for prevention and improve treatment and prognosis. NSCLP has a significant genetic basis, for example, the first degree relatives of affected individuals have a 30–40 fold elevated risk and phenotype concordance for monozygotic (MZ) twins is 40–60%, compared to 5% for di-zygotic twins1. Genetic studies including linkage analysis, genome-wide association (GWAS) and GWAS-based meta-analysis, have yielded reproducible evidence for the involvement of several genes and gene regions. Collins et al.2, listed 16 genes and gene regions which have been firmly implicated in NSCLP through linkage and association analysis. Several of these are broad regions where the underlying causal variant(s) have yet to be pinpointed, however, polymorphisms in genes such as IRF6 are strongly associated with NSCLP3 and more minor roles have been established for MSX14,5, PVRL1, FGFR2, PAX7, NOG and SPRY2 among others6.

Exome sequencing presents opportunities to identify rare coding variation that may contribute to risk of NSCLP phenotypes. If NSCLP is entirely multifactorial, the contribution of rarer variants may be largely polygenic and mediated by numerous variants of very small individual effect. In this case, causal genes may only be detectible through the analysis of large numbers of patients using, for example, burden tests7. However, there is growing evidence for involvement of rare variants of larger effect in NSCLP including, for example, truncating mutations in the ARHGAP29 gene8 and mutations in the IRF6 gene, which is also known to contain mutations involved in malformation syndromes that include CLP such as Van der Woude9. We consider here a number of NSCLP families with multiple affected individuals and undertake exome sequencing to investigate the contribution of rare variants in genes previously associated with any form of clefting phenotype.

Materials and Methods

Exome sequences of twelve individuals from seven multi-case families (CL1-CL7) with NSCLP phenotypes were obtained. All experimental protocols were approved by the Research Ethics Committee at the Universidad de La Sabana, Bogota; informed consent was obtained for all participants and research was conducted in accordance with the Declaration of Helsinki. Families included between two and six individuals with isolated NSCLP (Fig. 1). Most individuals have unilateral CLP but several individuals have the more severe bilateral phenotype.

Figure 1
figure 1

Pedigrees of families analysed.

+ symbol indicates that the individual has been exome sequenced (sequenced cases: two families with one family member; two families with parent and offspring; two families with sib pair; one family with avuncular pair).

DNA samples were extracted from blood collected at Operation Smile, Bogota, Colombia and exomes were captured using the Agilent SureSelect v5 (51 Mb) kit and sequenced on a HiSeq 2000. Read depth coverage statistics for all 12 exome sequences are given in Supplementary Table 1 and indicate ~85–97% coverage of exon targets at >20 fold depth across all samples. Orthogonal genotyping was performed for a panel of 24 SNPs to validate sample identity after processing10.

To understand the spectrum of potentially damaging variation, we considered the list of 865 genes previously implicated in any form of CLP phenotype presented by Pengelly et al.11 (Supplementary Table 2). Examining rare variation in genes in this comprehensive list enables evaluation of whether known CLP genes contain variation which may underlie more familial forms of NSCLP. Furthermore, because each exome contains a very large number of putatively damaging variants including those completely unrelated to the clefting phenotypes (including potential incidental findings), this strategy focussing only on genes previously implicated in any form of clefting is a practical route to identifying causal variation in these families. The list is derived in part (363 genes out of the 865) from the professional Human Gene Mutation Database12, using search terms related to clefts and clefting syndromes. The remaining genes in the list were included after corresponding interrogation of OMIM13 and a small number of additional CLP-related genes from the review by Collins et al.2.

We filtered the lists of variants (Fig. 2) found in the exome sequences to identify all non-synonymous (NS), stopgain, stoploss, splicing and indel variants in genes from this list. Following Pengelly et al.11, for NS variants we used the scaled predictive scores from dbNSFP v214 and considered only variants classed as deleterious or damaging by at least one of the following predictive metrics: PhyloP, SIFT, Polyphen2, LRT, MutationTaster and GERP++. Grantham scores were also assigned to all NS substitutions. All variants were annotated with the minor allele frequency (MAF) from the ExAC database15, combined CADD and Logit scores for deleteriousness, along with a combined overall rank developed from PhylopP, GERP++, CADD and Logit scores based on the summed ranks across all four scores such that a variant with overall rank 1 is predicted as most deleterious. For intronic variants within 10 bp of the exon we utilised MaxEntScan, based upon quantifying deviation from the expected splicing consensus sequence motif, to evaluate the likelihood of this variant affecting splicing, using a cutoff of a differential score of 316.

Figure 2
figure 2

Variant filtering process.

Variants identified in patients were filtered as described in methods. Variant attrition at each step is shown here, with variants remaining after sequential filtering detailed in square brackets.

We excluded variants found in homopolymer/repeat regions that can arise through misalignment between the sequenced reads and reference sequence. Any variants with read depth of <10 or in genes considered to be ‘highly mutable’17 were removed from further consideration. We included all variants not previously listed in the following databases: dbSNP 13518, 1000 genomes19, the exome variant server20 and our in-house database of ~300 exomes, but did not exclude variants present solely at low frequency in the ExAC database15. In Tables 1 & 2 we included only variants found in all exome-sequenced, affected, family members but not shared by more than one family; this was to exclude variants potentially common to the region not captured in the population resequencing projects. Because samples were not available for all family members, it was not possible to confirm segregation of putatively causal variants for all affected individuals. All variants presented in text were manually visualised to evaluate genotype quality in the raw alignment files using IGV21 and no features consistent with errors were present yielding high-confidence genotype calls. The full list of rare (<1% in 1000 Genomes) NS variants classed as damaging by at least one predictive score and potentially damaging splicing variants are given in Supplementary Table 3. Whole-exome genotype calls are provided in Supplementary File 4.

Table 1 Protein truncating, splicing and indel variants observed in single families.
Table 2 Non-synonymous variants observed in single families.

Results

Table 1 shows likely protein truncating and indel variants in these seven families, with Table 2 listing 28 missense variants. For a given family only variants found for all the exome-sequenced family members (Fig. 1) and classed as deleterious by at least one predictive score is given. Table 2 entries are ordered using combined ranks from most to least deleterious by predictive score11. Four of the genes listed in Table 2 (WNT7A, MSX1, CLPTM1 and EVC2, ranked 9, 10, 11 and 23 respectively) have been previously identified as containing variants implicated in NSCLP phenotypes. Family CL1 has the 9th ranked variant in the WNT7A gene. Members of the WNT gene family have previously been associated with NSCLP phenotypes22,23,24. Specifically, a number of WNT signalling pathway genes including WNT3A, WNT5A, WNT9B and WNT11 have been established as candidates22 and mouse expression studies have shown roles for WNT genes in mid-facial formation and lip and palate development25.

The 10th ranked variant, found in family CL4, is in the MSX1 gene and considered damaging by SIFT, PolyPhen-2 and MutationTaster and has high GERP++ and CADD scores. Variants in this gene have been strongly implicated in NSCLP in several studies. Jezewski et al.26 found mutations in 2% of CLP cases and indicated that this has genetic counselling implications where autosomal dominant inheritance patterns are found. Exon 2 of MSX1, in which the p.P260T is located, has been found to be highly conserved with significantly fewer sequence variants compared with exon 1 of this small (two exon) gene26. Functional validation of MSX1 as a candidate is established through a cleft palate and foreshortened maxilla phenotype in knockout mice27. A number of association studies have also indicated involvement of MSX1 in NSCLP4,28,29,30,31. In a study of 94 patients and 93 controls from Operation Smile, Colombia, four MSX1 microsatellite alleles were analysed and an increased risk of CLP was observed with CA polymorphisms in the gene32. An autosomal dominant MSX1 mutation in a family with clefting and tooth agenesis showed a familial pattern of segregating MSX1 mutations5. Diverse evidence establishes that MSX1 promotes growth and inhibits differentiation. Mutations in MSX1 can cause primary or secondary facial clefting in mouse models26.

The 11th ranked variant (from family CL1) is in the CLPTM1 gene (Cleft lip-and palate-associated transmembrane protein-1) which is situated at 19q13.3. A balanced translocation is this region was found in a multi-case CLP family33 and this region is implicated in NSCLP by linkage and transmission disequilibrium test association studies34. However a de novo deletion of 0.8 Mb in this region associated with CLP, but not encompassing CLPTM1, has been reported35. As Kohli and Kohli36 indicate, the role of CLPTM1 or other genes in this locus is uncertain.

The 23rd ranked variant is in the EVC2 gene (family CL2) and belongs to the same two megabase chromosomal region as MSX1 (4p16). Ingersoll et al.37 found linkage and association signals in genes in this region. They found suggestive evidence for linkage and association amongst cleft palate trios to EVC2. Mutations in EVC2 can lead to Weyers acrofacial dysostosis38, not usually associated with oral clefts but cases with subtle CLP phenotypes and tooth anomalies have been reported37.

Discussion

Linkage, candidate gene association and genome-wide association (GWAS) have been applied to investigate numerous multifactorial diseases, including NSCLP. As a result of these studies more than 11 genes and gene regions are now known or likely to have an etiologic role in NSCLP39. However, there is increasing evidence that NSCLP is a heterogeneous condition comprising a substantial multifactorial component along with a much smaller proportion of cases showing more Mendelian patterns of inheritance. The Gajdos et al.40 segregation analysis indicated that the complex familial patterns observed in NSCLP is best explained as a mixture of monogenic cases, probably dominantly inherited, combined with others which have a multifactorial aetiology. The conclusions favour analyses of multiple-case pedigrees to reduce heterogeneity and help identify Mendelian sub-forms. Stanier and Moore41 identified significant overlaps between genes underlying syndromic and nonsyndromic forms of CLP, recognising that several genes implicated in syndromic disease, including TBX22, PVRL1, IRF6, P63 and MSX1, can also contribute to ~10% of NSCLP. Scapoli et al.42 point out that the autosomal dominant Van der Woude syndrome (VWS) is only phenotypically distinguished from NSCLP by lower-lip pits and hypodontia which are only variably present in VWS affected individuals. Mutations in the IRF6 gene, which cause VWS, have been firmly implicated in some NSCLP cases3 supporting heterogeneity with the NSCLP clinical designation. Furthermore, Kerameddin et al.43 found a tag SNP (rs642961) in IRF6 was associated with the most severe complete bilateral NSCLP phenotype. This suggests multi-case families with bilateral clefts are the most likely to be segregating single gene mutations. This strategy is supported by Vieira et al.44 who indicate that point mutations in several genes contribute to ~6% of NSCLP and these are enriched in cases with bilateral clefting.

In Table 2, we identify a coding variant in the MSX1 gene shared by affected family members in CL4 in which the proband has a bilateral CLP phenotype. Direct sequencing of coding regions has shown rare mutations in MSX1 may account for ~2% of NSCLP. The identified MSX1 variant is present at low frequency in the ExAC database (Table 2). ExAC contains >60,000 exomes from various disease specfic and population genetic studies (http://exac.broadinstitute.org/). Functional studies and analyses of larger cohorts of multi-case NSCLP families are required to establish a possible role for this and other rare variants identified in NSCLP phenotypes. Variants identified here also include candidates in the WNT7A (family CL1) ,CLPTM1 (family CL1) and EVC2 genes (family CL2) which should be considered as targets for analysis in additional families.

For investigations aiming to resolve the genetic factors underlying NSCLP in multiple case families, exome sequencing presents a relatively cost-effective approach in which sequencing a small number of affected family members can identify candidate underlying genetic variation. NSCLP provides a particular challenge for genetic studies, with incomplete penetrance and environmental factors hindering the identification of aetiological variance2,39. We have aimed to minimise this effect by careful selection of pedigrees exhibiting clefting in multiple individuals, where we would expect a stronger genetic component. Filtering power would be increased by the inclusion of further members of the pedigrees, however this has not been viable due to the isolated geographic locations for many individuals.

Exome sequencing yields thousands of variants per individual and identification of candidate variants can only be achieved following extensive filtering. We have undertaken filtering to identify variants predicted as damaging by restricting analysis to a list of 865 genes which have been previously associated with any condition involving CLP. Such an approach risks missing causal variants in novel genes not previously linked to NSCLP, but facilitates practicable data interpretation by virtue of the greater prior probability that they are associated with NSCLP. The composite score based rank using PhyloP, GERP++. CADD and logit (Table 2) has been used successfully prioritise variants involved in syndromic CLP11, These four scores are closely correlated, although the composite measures are not independent in every case. Further improvements in predictive tools and recognition of more disease variants and understanding of disease pathways will enable future improvements in interpretation of these complex data sets.

Whilst predictive tools are essential for the prioritisation of variants discovered in next generation sequencing (NGS) studies, ultimately functional validation of the effects of variants on protein function is required to confirm their impact. Given the volume of potentially pathogenic variants being identified in NGS studies, routine functional validation is infeasible. In silico protein modelling approaches may also be used to improve throughput, however these require the prior determination of protein structure, which has not been reported in the majority of genes discussed herein. Overall, it is clear that functional validation is a significant bottleneck in NGS studies and one not readily assuaged.

The limitations of exome sequencing include lack of coverage outside gene coding regions thereby excluding regulatory variants, which may influence risk. Technical limitations include poor coverage of some coding regions thereby missing potential causal variants. Whole genome sequencing offers a solution to these coverage issues, but at higher cost and considerably increased analytical complexity. Given the extent of the missing heritability in CLP, it is likely non-coding regions of the genome play a significant role; whole genome sequencing may therefore provide a valuable tool as sequencing costs continue to drop.

In this study we have limited our analyses to 865 genes with a known/suspected involvement in CLP phenotypes. Whilst this will prevent us from identifying novel aetiological genes, 7 families would be underpowered to identify novel causal genes reliably. Large cohort studies are required in order to identify novel CLP genes; to this end we have made our WES data available in Supplementary File 4 for the use of other researchers.

In conclusion, we have undertaken exome analysis in seven Colombian families with NSCLP phenotypes. We find a deleterious variant in the MSX1 gene in family CL4 which is a strong candidate for causality. Deleterious variants in at least three additional genes may be implicated in NSCLP phenotypes in some of the other families. Although NSCLP is primarily a complex multifactorial phenotype, our study adds to the growing body of evidence that Mendelian sub-forms exist and these are best studied in multi-case families particularly where there are more severe phenotypic features such as bilateral clefting.

Additional Information

How to cite this article: Pengelly, R. J. et al. Deleterious coding variants in multi-case families with non-syndromic cleft lip and/or palate phenotypes. Sci. Rep. 6, 30457; doi: 10.1038/srep30457 (2016).