Introduction

Anorectal malformations (ARMs) are common birth defects occurring in approximately 1 in 5000 births [1]. The phenotypic presentation of ARMs ranges from mild (anterior placed anus) to severe (anal atresia). ARMs can occur as isolated defects, but often present with other structural anomalies (syndromic ARMs) such as those that make up VACTERL association (Vertebral, Anorectal, Cardiac, Tracheo-Esophageal, Renal, and Limb anomalies; MIM# 192350) [2]. Surgery is typically required within the first two years of life, and complete restoration of function is not always possible. Hence, individuals with ARMs may have long-term complications including bowel, urological, gynecological, and sexual difficulties.

The relatively high recurrence risk associated with ARMs suggests that genetic factors are likely to contribute to their development. Specifically, within a cohort of 327 individuals with ARMs, Dworschak et al. calculated a 1500-fold increase in recurrence risk for offspring of an affected parent and a 32-fold increase if a sibling was affected [3]. The existence of genetic syndromes in which ARMs are a common phenotype provides additional evidence of the importance of genetic factors in the development of these disorders. For example, ARMs are associated with chromosomal abnormalities and genomic disorders such as Down syndrome (MIM# 190685) and cat-eye syndrome (MIM# 115470), and with single gene disorders such as Townes-Brocks syndrome 1 (MIM# 107480) which is caused by pathogenic variants in SALL1. In a recent review, Khanna et al. also suggested the SHH, WNT, and FGF signaling pathways play a major role in the development of ARMs [4].

Despite progress in understanding the genetic etiology of ARMs, the underlying molecular cause of most cases cannot be identified. Exome sequencing (ES) is widely used to identify genetic changes in individuals with multiple congenital anomalies, especially in cases where a clinical diagnosis is not clear. ES has recently been used to identify a molecular diagnosis in individuals with syndromic ARMs [5,6,7]. However, ES is not always ordered on individuals with syndromic ARMs for whom other genetic tests have failed to identify a cause. This may be due, in part, to uncertainty regarding the efficacy of ES in individuals with syndromic ARMs.

In this study, we analyzed a clinical database of approximately 17,000 ES results to determine the diagnostic yield of ES in individuals with syndromic ARMs. We then used these data to identify eight phenotypic expansions involving ARMs.

Subjects and methods

Database analysis and clinical review

We searched for individuals whose test indications included ā€œanal stenosisā€, ā€œanal atresiaā€, ā€œimperforate anusā€, ā€œanterior anusā€ or similar descriptive entries in a database of ~17,000 individuals referred ES to Baylor Genetics. Individuals with an indication of ā€œanal fissureā€ and those who received a diagnosis on a molecular test other than ES were not included in this study.

Variants reported by Baylor Genetics to be related to clinical phenotypes listed in the indication for ES testing were reanalyzed and classified as pathogenic, likely pathogenic, or variants of uncertain significance (VUS) based on American College of Medical Genetics and Genomics (ACMG) standards for the variant interpretation using the most current data available [8]. Each potential diagnosis was then designated as definitive, probable, or provisional based on previously published criteria set forth by Scott et al. [9]. These criteria take into account the ACMG classification of the variant(s), their inheritance pattern, variant configuration (cis vs. trans), the sex of the proband, and the overlap between the phenotypes listed in the indication and phenotypes previously shown to be associated with disorders caused by the affected gene.

Calculating diagnostic yields

The number of cases with a definitive or probable diagnosis was divided by the total number of syndromic ARM cases to determine the diagnostic yield. We repeated this process for individuals with syndromic ARMs who initially met criteria for VACTERL association by having at least three VACTERL component features [10]. Some have argued that individuals with neurodevelopmental phenotypes should not be classified as having VACTERL association [11]. We have chosen to include these individuals in our VACTERL sub-cohort since ARMs, and other phenotypes associated with VACTERL association, are typically identified in the neonatal period before neurodevelopmental phenotypes such as developmental delay, intellectual disability and autism spectrum disorder become apparent. This is also the time point at which genetic testing is most likely to be initiated.

Literature and database searches

To identify additional cases of syndromic ARMs associated with various genes/genetic disorders, we performed literature searches to identify reports in which a geneā€™s symbol, or the name of its associated genetic disorder(s), was found in association with key words such as ā€œanalā€, ā€œanusā€, ā€œanorectalā€, ā€œanorectal malformationsā€, ā€œanal stenosisā€, ā€œanal atresiaā€, and/or ā€œimperforate anusā€.

Machine learning

We have previously developed a machine learning algorithm that integrates knowledge from genome-scale data sources including Gene Ontology (GO), the Mouse Genome Database (MGI), the Protein Interaction Network Analysis (PINA) platform, the GeneAtlas expression distribution, and transcription factor binding and epigenetic histone modifications data from theĀ NIH Roadmap Epigenomics Mapping Consortium to rank genes based on their similarity to a set of training genes known to cause a phenotype of interest [9, 12, 13].

To generate ARM-specific pathogenicity scores for all RefSeq genes, we trained this machine learning algorithm with a set of 26 manually-curated genes that are known to cause ARMs in humans or are the human homologs of genes known to cause ARMs in mice: CDC45L, CDX1, CRIM1, DACT1, DCHS1, EFNB2, FAM58A (CCNQ), FREM1, GLI3, INTU, KMT2D, MED12, MID1, MINK1, NOG, PCSK5, PITX2, RECQL4, RIPK4, SALL1, SALL4, SHH, SPECC1L, TBX3, WNT5A, ZIC3 [4, 14].

Cross validation can be used to demonstrate the performance of a machine learning procedure. In these analyses, a subset of the training genes is used to fit the machine learning procedure, which is then used to evaluate the genes that have been excluded or ā€œleft-outā€. This approach enables an objective calculation to characterize the computational procedure using a known input training set while avoiding a circular evaluation that conflates the fitting procedure with performance testing [12, 13].

In our cross-validation analysis, the full set of training genes was randomly broken into two subsets of equal size. The machine learning procedure was trained on each respective set and evaluated on the excluded subset. For each cross validated instance, a genome wide evaluation of all genes was performed including the excluded subset of training genes. The percentiles of the excluded genes were then recorded to assess performance. The procedure was repeated, reciprocally, so that all training genes received cross-validated scores.

These scores were then plotted to characterize the performance of the procedure by tabulating the fraction of training set genes with score percentiles exceeding each cutoff, forming positive receiver operation (ROC) style curves where the effectiveness of the procedure corresponds to the area under the curve and above the diagonal line which represents the result that would be generated by chance alone. These studies generated positive ROC curves based on data from each knowledge source, and the average of the scores across all knowledge sources (Fig.Ā 1A). This demonstrated the ability of our scoring procedure to identify ARMs training genes more efficiently than random chance.

Fig. 1: Machine learning allows all RefSeq genes to be ranked based on their similarity to genes known to cause ARMs.
figure 1

A Receiver operating characteristic (ROC) curves were generated in validation studies of our machine-learning scoring approach. In this figure, colored ROC curves were generated using data from a single knowledge source, and the black ROC curve represents an omnibus score generated using the average score of all knowledge sources. The positive area underneath each curve indicates that our scoring approach identified training set genes known to cause ARMs more efficiently than random chance (diagonal dashed line). B After validation, ARMs-specific pathogenicity scores were calculated for all RefSeq genes. Box plots were generated based on the ARM-specific pathogenicity scores of (1) training set genes, (2) genes for which there is sufficient evidence to support a phenotype expansion involving ARMs (TableĀ 1), and (3) genes for which there is currently insufficient evidence to support a phenotype expansion involving ARMs (TableĀ 2). The median pathogenicity scores of the genes listed in TableĀ 1 (83.3%) and TableĀ 2 (70.5%) are lower than median pathogenicity score of the training set (98%) but exceed the median for all RefSeq genes (50%) indicated by the dashed line. This indicates that each of these groups is enriched for genes that are similar to the known ARMs genes in the training set. Epi =Ā epigenetic histone modifications data from theĀ NIH Roadmap Epigenomics Mapping Consortium, Exp = the GeneAtlas expression distribution, GO =Ā Gene Ontology, MGI =Ā the Mouse Genome Database, PINA =Ā the Protein Interaction Network Analysis platform, TF =Ā transcription factor binding data from theĀ NIH Roadmap Epigenomics Mapping Consortium.

Having validated the algorithm, we generated ARM-specific pathogenicity scores for each gene. This was done by determining the centile rank of each gene as compared with all other RefSeq genes using an omnibus score based on the average fit generated using all knowledge sources. Hence, the ARM-specific pathogenicity score for each RefSeq gene ranges from 0 to 100% with a mean and median of 50%.

Statistical analysis

To compare the diagnostic yields between sub-cohorts, two-tailed Fisherā€™s exact tests were performed using a 2ā€‰Ć—ā€‰2 contingency table calculator available through GraphPad Quick Calcs (https://www.graphpad.com/quickcalcs/contingency1/). To compare the diagnostic yields between individual ARM phenotypes, chi-square tests were performed using a 3ā€‰Ć—ā€‰2 contingency table calculator available through Social Science Statistics (https://www.socscistatistics.com/tests/chisquare2/default2.aspx). Box plots were generated using the Alcula.com Statistical Calculator: Box Plot program (http://www.alcula.com/calculators/statistics/box-plot/).

Results

Diagnostic yield of ES

From a cohort of ~17,000 individuals referred for clinical exome sequencing, we identified 130 individuals (includingĀ Subjects S1-S61) with imperforate anus/anal atresia, anal stenosis, or anteriorly placed anus, who had at least one additional birth defect or neurodevelopmental phenotype (syndromic ARMs). No cases of non-syndromic ARMs were referred for ES. A definitive (nā€‰=ā€‰30; 23.1%) or probable (nā€‰=ā€‰15; 11.5%) diagnosis was made in 45 individuals for a molecular diagnostic yield of 34.6% (45/130). Additionally, a provisional diagnosis was made in 16 individuals. If these were to be included, ES diagnostic yield would increase to 46.9% (61/130). The clinical and molecular data for all subjects in which a definitive, probable, or provisional diagnosis was made are shown in Supplemental TableĀ S1.

Of the 130 individuals with syndromic ARMs, 71 initially met criteria for VACTERL association, defined as having at least three VACTERL component features, and 59 did not. Considering only individuals with a definitive or probable diagnosis, the ES diagnostic yield for the sub-cohort that initially met criteria for VACTERL was 26.8% (19/71). This was significantly lower than the ES diagnostic yield for the sub-cohort that did not initially meet criteria for VACTERL association (44.1%, 26/59; pā€‰=ā€‰0.0437). If individuals with a provisional diagnosis were also included, the ES diagnostic yield for the sub-cohort that initially met criteria for VACTERL (40.8%, 29/71) was still less than that of the sub-cohort that did not initially meet criteria for VACTERLĀ (54.2%, 32/59), but the difference was no longer statistically significant (pā€‰=ā€‰0.1586).

Recurrently altered genes, and genes associated with ARMs

Putatively deleterious variants in several genes were recurrently identified in our cohort. These genes included ADNP (S4, S5), BBS1 (S11, S12), FGFR3 (S26, S30), KMT2D (S16, S31, S37, S38), LRP2 (S21, S32, S39), NIPBL (S31, S44), and SALL1 (S16, S30).

A subset of individuals in our cohort carried variants in genes that have previously been associated with an increased risk of developing ARMs. These genes included AMER1, ARID1A, BRCA2, CDH1, CHD7, DHCR7, FAM58A, FGFR3, GRIP1, JAG1, KAT6B, KIF7, KMT2D, MID1, MNX1, MYCN, NIPBL, POR, PQBP1, RAD51, SALL1, SALL4, and SPECC1L (TableĀ S1). The remaining subjects in our cohort only had changes in genes not clearly associated with ARMs. These were considered ARM candidate genes.

To determine which of these candidate genes were most likely to contribute to the development of ARMs, we performed a literature review to identify previously published cases in which these genes were mutated in an individual with ARMs, or in which an individual with ARMs was diagnosed with one of their corresponding genetic syndromes. As an additional means of determining the likelihood that a candidate gene could contribute to the development of ARMs, we used a previously published machine learning algorithm to generate ARM-specific pathogenicity scores for all of the candidate genes [12, 13]. These scores represent the percentile rank of the similarity of each RefSeq gene to a set of 26 genes known to cause ARMs in humans or the human homologs of genes known to cause ARMs in mice [4, 14].

Among candidate genes that carried variants from which a definitive or probable diagnosis was made in our cohort, there is sufficient evidence to suggest a phenotypic expansion involving ARMs for ADNP, BBS1, CREBBP, EP300, FANCC, KDM6A, SETD2, and SMARCA4. The evidence for these associations is summarized in TableĀ 1. In contrast, there is currently insufficient evidence to suggest that the other genes that carried variants from which a definitive or probable diagnosis was made are associated with the development of ARMs. The evidence for these genes is summarized in TableĀ 2.

Table 1 Genes with sufficient evidence to support a phenotypic expansion involving ARMs.
Table 2 Genes with definitive or probable diagnoses that do not currently have sufficient evidence to support a phenotypic expansion involving ARMs.

We then compared the ARM-specific pathogenicity scores of the training genes, the genes for which there was sufficient evidence to suggest a phenotypic expansion involving ARMs (TableĀ 1) and those for which there is currently insufficient evidence to support a phenotypic expansion involving ARMs (TableĀ 2). As expected, the training set had the highest median score (98%), followed by the median scores of the genes for which there was sufficient evidence to support a phenotypic expansion involving ARMs (83.8%), and the median of the genes for which current data were insufficient to support a phenotype expansion (70.5%) (Fig.Ā 1B). The medians of the genes listed in TablesĀ 1 and 2 exceeded the median for all RefSeq genes (50%), indicating that each of these groups are enriched for genes that are similar to the known ARM genes in the training set.

Discussion

ES is widely used to identify genetic changes in individuals with multiple congenital anomalies, and the clinical utility of ES has been clearly demonstrated. ES has specifically been shown to be effective in identifying the molecular etiology of syndromic ARM cases [15]. However, uncertainty about its diagnostic yield may explain, in part, why ES is not universally ordered in individuals with syndromic ARMs. Here, we used data from 130 individuals to estimate the diagnostic yield of ES in syndromic ARM cases and to identify new phenotypic expansions.

High diagnostic yield of clinical ES in syndromic ARM

In this study we found that the molecular diagnostic yield of ES in individuals with syndromic ARMs was high: 34.6% (45/130) when considering only definitive and probable diagnoses and 46.9% (61/130) when provisional diagnoses were included. To our knowledge, this is the first study to specifically report on the molecular diagnostic yield of ES in this patient population.

Interestingly, the molecular diagnostic yield in individuals with syndromic ARMs who initially met criteria for VACTERL association was significantly lower than those who did not meet criteria: 26.8% (19/71) vs. 44.1% (26/59); pā€‰=ā€‰0.0437. In a 2017 study, Meng et al. identified a similar trend, where the ES diagnostic yield for individuals with congenital heart defects (CHD) who initially met criteria for VACTERL association was relatively low compared to other non-VACTERL phenotypes [16]. Recently, Sy et al. also reported that the efficacy rate of ES in individuals with syndromic esophageal atresia/tracheoesophageal fistula (EA/TEF) who initially met criteria for VACTERL association was lower than that of individuals with EA/TEF that did not initially meet criteria (13% versus 18.2%), although this difference did not reach statistical significance [17]. These data suggest that tests designed to identify monogenic etiologies may have lower diagnostic yields in individuals who initially meet the criteria for VACTERL association.

In considering why we see a lower molecular diagnostic yield, we note that epigenetic factors have been described as possibly contributory to VACTERL association, and de novo epivariants have been associated with congenital anomaly syndromes [18, 19]. Non-genetic considerations such as the maternal risk factors of conception via assisted reproductive technologies, pregestational diabetes, and chronic lower obstructive lower pulmonary diseases are also associated with an increased risk of having a child with VACTERL association [20]. Further research into the genetic, epigenetic, and environmental factors that contribute to the development of VACTERL association is warranted.

Although the data presented here provide clear evidence that ES can be used to identify a molecular diagnosis in a significant percentage of syndromic ARM cases, we recognize the limitation imposed by the retrospective and deidentified nature of this study. A prospective, clinic-based study may provide confirmation of these findings and may also allow comparisons between the yields of ES and other genetic testsā€”such as chromosome microarray analysis (CMA)ā€”in individuals with syndromic ARMs.

Phenotypic expansions involving ARMs

ADNP

Pathogenic variants in ADNP are associated with Helsmoortel-van der Aa syndrome (HVDAS; MIM# 615873). HVDAS is characterized by intellectual disability, motor delay, autism spectrum disorder, hypotonia, dysmorphic facial features, vision complications, congenital heart disease, and gastrointestinal complications such as gastroesophageal reflux and constipation [21,22,23]. In our cohort, S4 and S5 carried de novo pathogenic frameshift variants in ADNP. S4 presented with anal stenosis, while S5 presented with an anteriorly placed anus. One other individual with HVDAS and an ARM has been described [24]. The identification of three individuals with HVDAS and syndromicĀ ARM combined with ADNPā€™s high ARM-specific pathogenicity score (82.9%) lead us to conclude that individuals that carry pathogenic variants in ADNP can present with ARMs as part of HVDAS.

BBS1

Bardet-Biedl syndrome (BBS) is a genetically heterogenous disorder [25]. Pathogenic variants in BBS1 are the cause of Bardet-Biedl syndrome 1 (BBS1; MIM# 209900) and are the most common cause of BBS occurring in 23.4% of all individuals with this disorder [26, 27]. ARMs have been previously described in individuals with BBS. Specifically, Baheci et al described an individual with BBS who had congenital anal atresia, and Hedge et al described a 10-month-old female with BBS and an abnormal site of the anal opening [28, 29]. Unfortunately, molecular diagnoses were not reported for these individuals. Additionally, the Clinical Registry Investigating Bardet-Biedl Syndrome [CRIBBBS] database includes a small percentage of individuals with anomalies of the gastrointestinal tract [30]. In our cohort, we identified 2 individualsā€”S11 and S12ā€”who carried homozygous pathogenic variants in BBS1. These data, combined with the high ARM-specific pathogenicity score of BBS1 (84.6%) leads us to conclude that ARMs can be a presenting feature of BBS, particularly BBS1 caused by pathogenic variants in BBS1.

CREBBP and EP300

Pathogenic variants in CREBBP and EP300 are associated with Rubinstein-Taybi syndrome 1 (RTS1; MIM# 180849) and 2 (RTS2; MIM# 613684), respectively. RTS is characterized by developmental delay, postnatal growth deficiency, microcephaly, broad thumbs and halluces, and dysmorphic facial features [31]. Pathogenic variants in CREBBP make up approximately 50ā€“70% of all individuals with RTS, while only 5ā€“8% of individuals with RTS have pathogenic variants in EP300 [32]. Enomoto et al. reported one individual with a de novo pathogenic deletion in CREBBP and anal atresia, and Cohen et al. described two individuals with ARMs who carried EP300 variants [33, 34]. In our cohort, S18 carried a de novo pathogenic frameshift variant in CREBBP, and S22 carried a pathogenic frameshift variant in EP300. These data, along with their positive ARM-specific pathogenicity scores (CREBBPā€‰=ā€‰89.2%; EP300ā€‰=ā€‰78.1%) suggest that individuals with either RTS1 or RTS2 may present with ARMs.

FANCC

ARMs are a known feature of Fanconi anemia (FA). In our cohort there were five individuals with changes in genes associated with FA (S13, S14 BRCA2; S24, FANCC; S25, FANCI; S52, RAD51). Biallelic changes in BRCA2 and heterozygous variants in RAD51 have been observed in individuals with ARMs [35, 36]. However, FANCC and FANCI have not been previously associated with ARMs. The diagnostic certainty for S24 was considered definitive as this individual carried a homozygous pathogenic FANCC variant. The ARM-specific pathogenicity score of FANCC is 89.0%. Taken together, these data suggest that FANCC is associated with ARMs. In contrast, the diagnostic certainty for S25 was considered provisional since this individual only carried only a single pathogenic variant in FANCI which is associated with an autosomal recessive form of FA. We also note that the ARM-specific pathogenicity score for FANCI was only 41.5%. Hence, there is currently insufficient evidence to support the association between FANCI and ARMs.

KDM6A

Variants in KMT2D, which are associated with Kabuki syndrome 1 (KABUK1; MIM# 147920), were identified in four Subjects (S16, S31, S37, S38), making it the most commonly affected gene in our cohort. The diagnostic certainty for S37 and S38 was definitive while the certainty for S16 and S31 was provisional. Although ARMs are known to be associated with Kabuki syndrome 1 [37,38,39,40], they are not a common feature of Kabuki syndrome 2 (KABUK2; MIM# 300867), which is caused by pathogenic variants in KDM6A [41]. S33 carried a de novo pathogenic variant in KDM6A and presented with an anteriorly placed anus. One other individual with Kabuki syndrome 2 and an ARM has been reported [42], and KDM6A has a positive ARM-specific pathogenicity score (71.1%). Taken together, these data suggest KDM6A is associated with the development of ARMs.

LRP2

Variants in LRP2, which is associated with Donnai-Barrow syndrome [DBS; MIM# 222448] were identified in three Subjects (S21, S32, S39). However, their diagnoses of Donnai-Barrow were classified as provisional, and all three had variants in at least one other gene included on their ES report. The ARM-specific pathogenicity score for LRP2 is 64.1%. The presence of three ARMs cases in our cohort suggest the possibility that LRP2 deficiency contributes to the development of ARMs, however, additional evidence is needed to confirm this association.

SETD2

Pathogenic variants in SETD2 are associated with Luscan-Lumish syndrome (LLS; MIM# 616831) which is characterized by macrocephaly, intellectual disability, speech delay, low sociability, and behavioral problems. Other more variable features include postnatal overgrowth, obesity, advanced carpal ossification, developmental delay, and seizures [43, 44]. In our cohort, S15 carried a de novo, pathogenic missense variant in SETD2 (c.5218ā€‰Cā€‰>ā€‰T [NM_014159.7], p.(R1740W)). This individual presented with an anteriorly placed anus, ventriculomegaly, a cleft palate, congenital heart defects, bilateral 2ā€“3 syndactyly of the hands and feet, a renal cyst, feeding difficulties, respiratory distress, and dysmorphic features. Rabin et al. previously reported two individuals with anteriorly placed anus who carry the same SETD2 pathogenic variant seen in S15 [45]. They suggested that this variant may have a gain-of-function effect and cause phenotypes that are divergent, and more severe, than those associated with LLS [45]. Specifically, they reported brain anomalies, cleft palate, congenital heart defects, abnormalities of the hands and feet, genitourinary anomalies, respiratory abnormalities, feeding difficulties, and dysmorphic features in individuals with the SETD2 c.5218ā€‰Cā€‰>ā€‰T [NM_014159.7], p.(R1740W) variant.

Additionally, Lovrecic et al. reported two individuals with rare 3p21.31 deletions who presented with anal atresia [46]. SETD2 is located in the genomic region of overlap between the deletions identified in these individuals along with 12 other protein-coding genes, and a portion of MAP4 (Supplementary Fig.Ā 1) [47]. The features of these genes are summarized in Supplementary TableĀ 2. Among these genes, SETD2, SMARCC1, and DHX30 are loss-of-function intolerant with pLI scores of 1 in gnomAD [48]. However, only SETD2 has been observed to be independently associated with ARMs. The ARM-specific pathogenicity score for SETD2 is 77.6%. Hence, it is possible that pathogenic single nucleotide variants in SETD2, including the c.5218ā€‰Cā€‰>ā€‰T [NM_014159.7], p.(R1740W) variant, and haploinsufficiency of SETD2 may predispose to the development of ARMs.

SMARCA4

In our cohort, there were 2 individuals who carried variants in genes associated with Coffin-Siris syndrome (S9, ARID1A; S6, SMARCA4). Pathogenic variants in ARID1A have been previously observed in individuals with ARMs, but variants in SMARCA4 have not [49]. S6 carried a de novo likely pathogenic SMARCA4 variant, and the ARM-specific pathogenicity score for SMARCA4 is 85.7%. These data suggest that deleterious variants in SMARCA4 may lead to the development of ARMs.

Clinical practice recommendations

These data suggest that ES should be considered for all individuals with syndromic ARMs in whom genetic testing has failed to identify a molecular diagnosis. Our data also suggest that additional testing aimed at identifying an independent cause of ARMs may not be warranted in individuals with a diagnosis of Helsmoortel-van der Aa syndrome, Bardet-Biedl syndrome 1, Rubinstein-Taybi syndromes 1 and 2, Fanconi anemia group C, Kabuki syndrome 2, SETD2-related disorders, or Coffin-Siris syndrome 4.