Introduction

Esophageal atresia/tracheoesophageal fistula (EA/TEF) is a rare, complex congenital aerodigestive anomaly with an estimated incidence of 1 in 2500 to 1 in 4000 live births [1, 2]. Almost half of infants born with this congenital anomaly have associated congenital malformations of other organ systems, most commonly cardiovascular, digestive [1], urogenital, and musculoskeletal [3]. These defects have been observed together as the vertebral defects, anal atresia, cardiac defects, tracheoesophageal fistula, renal anomalies, and limb abnormalities (VACTERL) association [4]. While there have been rare reports of variants in FOXF1 and ZIC3 in VACTERL-association patients [5], the molecular etiology for the majority of VACTERL cases remains unknown. Chromosome anomalies including aneuploidies and microdeletions are observed in 6–10% of nonisolated EA/TEF [3, 5] patients. These anomalies include trisomy 13, 18, and 21, monosomy X [6], and several copy number variants (CNVs). Several monogenic causes of syndromes that include EA/TEF have also been elucidated and include variants in MYCN, SOX2, CHD7, and MID1. Monogenetic causes account for only about 5% of EA/TEF cases, and are mostly de novo (with the exception of variants in recessive Fanconi anemia-related genes) [5,6,7].

SOX2 has been reported as an important gene for esophagus and anterior stomach development [8]. SOX2 is involved in Wnt signaling by binding β-catenin, a central mediator of the Wnt pathway [9]. Deletion of the Wnt signaling downstream mediator β-catenin leads to lung agenesis, and the foregut fails to separate [10]. EFTUD2 is associated with esophageal atresia and other developmental disorders such as mandibulofacial dysostosis with microcephaly with the heterozygous loss of function variants [11,12,13]. EFTUD2 is required for pre-mRNA splicing as component of the spliceosome [14, 15].

There have been few studies investigating the genetic causes of nonisolated EA/TEF, and it is still widely considered to have a multifactorial etiology. Small scale twin studies, however, have shown a higher concordance rate between monozygotic twins (67%) compared to dizygotic twins (42%), suggesting a genetic contribution [16, 17]. Animal studies have identified genes in several developmental pathways associated with tracheoesophageal anomalies, among them sonic hedgehog pathway genes. Murine models with homozygous deficiencies of SHH and GLI2 exhibit foregut anomalies including EA, TEF, and tracheoesophageal stenosis and hypoplasia [18]. Other developmental genes involved with foregut development in animal studies include transcription factors Foxf1, vitamin A effectors (Rarα, Rarβ) homeobox-containing transcription factors and their regulators (Nkx2.1 [19], Hoxc4, Pcsk5), and developmental transcriptional regulators (Tbx4, Sox2) [3, 20].

EA/TEF is identified prenatally in about 50% of cases. When the diagnosis is suspected (usually by sonographic findings of polyhydramnios and a small stomach), prognostic clinical information about associated birth defects is commonly sought. Definitive prognostic information is usually limited unless a chromosomal anomaly is identified. In an effort to identify novel genetic variants associated with EA/TEF, we studied 45 individuals with EA/TEF and their biological parents, none of whom had a family history of EA/TEF. We sought to identify novel genetic causes of EA/TEF using exome sequencing (WES). Our goal is to understand the genomic architecture of EA/TEF, and to better characterize the syndromes and conditions associated with EA/TEF. We designed this pilot study to assess whether genomic characterization of EA/TEF would provide more accurate prognostic information and help tailor therapy based on predicted phenotype. We plan to combine these data with that of other congenital malformations to provide a more comprehensive understanding of human development.

Methods

Subject recruitment

Patients with isolated and nonisolated EA/TEF were recruited from two medical centers—Columbia University Medical Center (CUMC) in New York, USA and Cairo University General Hospital in Cairo, Egypt. Subjects eligible for the study included individuals diagnosed with known forms of EA/TEF and no family history of EA/TEF, based upon medical record review. All participants provided informed consent. The study was approved by the Columbia University institutional review board. Blood and/or saliva samples were obtained from the probands and both biological parents. A three-generation family history was taken at the time of enrollment and clinical data were extracted from the medical records and by patient and parental interview.

Exome sequencing

Exome sequencing was performed at Novogene Genome Sequencing Company (Chula Vista, CA). A total of 1.0 μg genomic DNA was used as input material. Sequencing libraries were generated using Agilent SureSelect Human All ExonV6 kit (Agilent Technologies, CA, USA) following manufacturer’s recommendations. Briefly, fragmentation was carried out by hydrodynamic shearing system (Covaris, MA, USA) to generate 180–280 bp fragments. Remaining overhangs were converted into blunt ends via exonuclease/polymerase activities, and enzymes were removed. After adenylation of 3′ ends of DNA fragments, adapter oligonucleotides were ligated. DNA fragments with ligated adapter molecules on both ends were selectively enriched in a PCR reaction. Captured libraries were enriched in a PCR reaction to add index tags to prepare for hybridization. Products were purified using AMPure XP system (Beckman Coulter, Beverly, USA) and quantified using the Agilent high sensitivity DNA assay on the Agilent Bioanalyzer 2100 system. The qualified libraries were sequenced on an Illumina HiSeq sequencer after pooling according to effective concentration and expected data volume. Read length were paired-end 150 bp.

Bioinformatics analysis and calling of de novo variants

We used GATK-recommended best practices for calling single nucleotide variants (SNVs) and short insertions and deletions (indels) from exome sequencing data. Specifically, we used BWA-mem [21] to align reads to human reference genome (GRCh37), Picard Tools to mark PCR duplicates, and GATK [22] haplotypeCaller for calling variants jointly from all sequenced samples, and GATK variant quality score recalibration (VQSR) to recalibrate variant quality. We applied multiple heuristic filtering rules to remove potential technical artifacts as previously described [23, 24]. Specifically, we only retained variants that met all the following criteria: GQ  30, FS  25, QD  2 (SNV), QD  1 (INDEL), ReadPosRankSum  −3 (INDEL), read depth on alt allele  5, alt allele depth to total depth  0.1, VQSRSNP   99.80, VQSRINDEL   99.70 and mappability (based on 200 insert length) = 1.

To call de novo variants, we applied a previously published procedure [23, 24] and used IGV [25] to visualize candidate de novo variants and remove potential artifacts. All nonsynonymous de novo variants were sanger confirmed. In addition, we used PLINK to infer population structure and kinship. We used xHMM [26] to infer large CNVs to ruled out patients who potentially get EA/TEF due to chromosomal anomalies.

Annotation and in silico prediction

We used ANNOVAR [27] to annotate variants and aggregate population frequency (Exome Aggregation Consortium (ExAC)) and Genome Aggregation Database [28], protein-coding consequence, and multiple in silico predictions on genetic variants, including CADD [29] and REVEL [30].

Putative targets of EFTUD2 or SOX2

We obtained putative targets of EFTUD2 based on RNA binding protein (RBP) binding sites profiled by eCLIP in a HepG2 cell line from ENCODE [31] and processed using a recently published pipeline [32]. We selected the genes for which the peak count is equal to or greater than 2. We obtained target genes of transcription factor SOX2 based on ChIP data from glandular mouse stomach [33] curated by ChEA [34].

Statistical analysis

For de novo variants, we determined the overall burden of four variant types including synonymous, likely gene disrupting (LGD, i.e., stop gain, frameshift, and splice site), missense and deleterious missense (D-mis, defined by REVEL ≥ 0.5 or CADD Phred score ≥ 25) in all genes and constrained genes (defined by ExAC [28] pLI ≥ 0.5). We used a less stringent pLI threshold for defining constrained genes, because it captures more known haploinsufficient genes [35]. We obtained estimated background mutation rate in previous publications calibrated for exome sequencing data [36]. The expected number of variants in different gene sets were calculated by summing up the background mutation rate of the specific variant class in the gene-set multiplied by twice the number of cases. We then test the burden of de novo variants in a gene set by a Poisson test with the baseline expectation as the mean under the null model. To estimate the proportion of cases that can be attributed to de novo deleterious variants, the difference between the observed number and expected number of de novo deleterious variants is divided by the number of cases [37].

Results

Exome sequencing data

A total of 45 individuals with EA/TEF were enrolled into the study. Probands were between the ages of 1.5 years and 55.7 years with an average of 10.2 years old (Table 1). Thirteen probands had isolated EA/TEF and 32 probands had neurodevelopmental delay and/or at least one additional congenital defect and were classified as nonisolated. Fourteen of the probands had congenital heart defects, 8 had neurodevelopmental delay, 4 had gastrointestinal defects, 12 had genitourinary defects (nonrenal), 8 had skeletal defects, 2 had craniofacial defects and 2 had other defects. The majority of probands were of European ancestry (60%), and the remaining were of African-American (15%), Egyptian (15%), and Asian (10%) ancestry. None of the 45 probands reported a family history of EA/TEF.

Table 1 Patient characteristics of 45 patients with esophageal atresia.

Overall burden of de novo variants

We identified 57 de novo variants in 45 probands (Supplementary Table 1). We compared overall burden of de novo variants in 45 cases to expectations from a background mutation model [36]. We classified protein-coding variants into four groups: synonymous, missense, D-mis, and LGD. Overall the frequency of synonymous variants in cases is close to expectation from background mutation rate (p value = 0.68, enrichment rate = 1.1×). There is a trend of enrichment of missense variants (p = 0.12, enrichment rate = 1.3×) and D-mis variants (p = 0.06, enrichment rate = 1.6×) in cases compared to expectation (Table 2).

Table 2 Overall burden of de novo heterozygous variants.

Consistent with previous studies of other types of birth defects [24, 38, 39], the enrichment of D-mis variants is more pronounced (p value = 0.003, enrichment rate = 2.6×) in constrained genes that are intolerant of loss of function variants (ExAC pLI ≥ 0.5) (Table 2).

Most of genes with deleterious de novo variants are putative targets of EFTUD2 or SOX2

One patient has a de novo frameshift deletion (c.2314del, p.(Gln772ArgfsTer21)) in EFTUD2 (elongation factor Tu GTP binding domain containing 2). The phenotype of the patient includes EA/TEF, bilateral clubfoot, hydrocele, atrial septal defect, and pyepylectasislectasis, which overlaps with features of Guion-Almeida type of mandibulofacial dysostosis caused by heterozygous EFTUD2 variants [13]. De novo variants in EFTUD2 are known to be associated with EA [11, 12]. EFTUD2 encodes a component of the spliceosome complex that regulates mRNA splicing, a master regulator that potentially regulates the expression of thousands of genes. We hypothesized that genes regulated by EFTUD2 and other master regulators relevant to EA/TEF (such as SOX2 [8]) are more likely to be EA/TEF risk genes and therefore enriched with de novo variants. To test this, we obtained putative targets of EFTUD2 based on eCLIP data in a HepG2 cell line from ENCODE [31] and targets of SOX2 based on ChIP-seq data in mouse stomach [33]. There are 1629 and 4463 targets of SOX2 and EFTUD2, respectively; and the union of the targets is 5454. Among 19 genes with D-mis de novo variants, 15 are targets of SOX2 or EFTUD2, much larger than expected by background (enrichment rate = 3.34×, p value = 6.6e−05). Overall, the burden indicates that 33% of EA/TEF patients are attributable to deleterious de novo variants in genes that are SOX2 or EFUD2 targets.

Table 3 summarizes the associated clinical features and variants in candidate genes prioritized by intolerance to loss of function variants and biological pathways implicated in developmental disorders. Seven genes, ADD1, APC2, GLS, SMAD6, RAB3GAP2, PTPN14, and EFTUD2 are OMIM genes and are associated with Mendelian diseases (Table 3). ITSN1 was recently discovered as a risk gene for autism spectrum disorder [40]. The ITSN1 variant carrier was only 18 months at the time of enrollment, which is too young to make the diagnosis of autism.

Table 3 De novo heterozygous variants in candidate genes.

Discussion

In this pilot study, we report exome sequencing results on 45 proband-parent trios with isolated or nonisolated EA/TEF with no family history of EA/TEF. We identified 22 LGD or D-mis de novo variants. Consistent with previous studies of structural birth defects or developmental disorders, genes that are constrained are enriched with deleterious variants, likely due to a historical reduction of reproductive fitness by such predicted deleterious variants. The majority of the genes with deleterious de novo variants are putative targets of SOX2 or EFTUD2, two master regulators that are known to cause EA/TEF through haploinsufficiency and may provide a biological mechanism for the etiology of some EA/TEF. Figure 1 shows genes with LGD or D-mis de novo variants and their relationships with EFTUD2 and SOX2. We did not identify any de novo variants in SOX2 gene in our small cohort. Given the overall high enrichment rate of 3.34, we expect that more than half of target genes of SOX2 or EFTUD2 with de novo predicted pathogenic variants are candidate EA/TEF risk genes [37, 41].

Fig. 1: Genes with LGD or D-mis de novo variants and their relationship with EFTUD2 and SOX2.
figure 1

Each gene is represented by a circle. Arrows indicate putative TF-target or RBP-target relationships. We did not observe de novo variants in SOX2 (dashed circle) in our cohort. Genes are colored by biological pathways. Only the pathways with at least three genes with LGD or D-mis variants are shown.

Three genes, ADD1, GLS, and RAB3GAP2, are putative targets of both EFTUD2 and SOX2 [31, 33]. Notably, ITSN1, AP1G2, TECPR1, and RAB3GAP2 are involved in membrane trafficking pathway or autophagy [42,43,44,45]. KLHL17, ADD1, CELSR2, PCDH1, and ITSN1 are involved in cytoskeleton or cell adhesion [42, 46, 47]. AMER3 and APC2 are both key regulators in Wnt signaling, a process known to be implicated in EA/TEF and other birth defects [48]. A few other genes, SMAD6, PTPN14, and PIK3C2G, are involved in signaling pathways that are critical during development [46, 49, 50].

Our current analysis is limited by the source of ChIP-seq of SOX2 from stomach [33] and eCLIP of EFTUD2 from a liver cancer cell line [31]. The availability of data from relevant tissues, e.g., ChIP-seq of SOX2 and eCLIP-seq of EFTUD2 in developing foregut, will enable more precise analysis of de novo and rare variants. In addition, gene expression data, especially single cell sequencing data, of developing esophagus and trachea, will also allow us to refine the analysis and improve the ability to identify the most relevant EA/TEF genes.

Finally, it will be important to increase the sample size of future genomic studies to more precisely estimate the contribution of de novo variants to EA/TEF, and to identify novel risk genes with high confidence and relate the genetic factors to clinical outcomes.