Introduction

Autism spectrum disorder is characterized by deviations and delays in the development of reciprocal social interaction and communication, in combination with restricted and repetitive behaviors and interests.1 The prevalence of the broad autism spectrum has recently been estimated to be approximately 1% of the childhood population, whereas the prevalence rate for autism is estimated at approximately 4 per 1000 births.2, 3 Phenotypically, autism is very heterogeneous, with varying degrees of severity and associated intellectual functioning.4 The large variety of neuropathological changes and the variability seen across subjects imply that autism is also etiologically heterogeneous.5

Cumulative evidence from family and twin studies suggests that genetic factors have an important role in the pathology of autism.5, 6 The genetic contribution to autism has been estimated to be as high as 90 percent.4, 7, 8, 9 Findings of cytogenetic abnormalities and single-gene disorders associated with autism indicate that the disorder is genetically complex, involving multiple (interacting) loci.4, 6, 7 Although few susceptibility loci have been consistently replicated, the overlap in linkage findings from genome scans suggests various regions that harbor autism susceptibility genes. Loci found in at least two independent linkage studies are in the regions 2q, 3q25–27, 3p25, 6q14–21, 7q31–36, and 17q11–21.4 However, each of these loci contains hundreds of genes, of which multiple genes have been implicated in autism. Among these, other genes have been identified from independent association studies,7, 8, 9, 10 but no gene has been unequivocally shown to contribute to autism susceptibility.

The results of all molecular genetic studies point to a model of multiple genetic variants that supposedly can interact in various ways with regard to the phenotypic expression of autism.4 Recently, evidence appeared that small cytogenetic aberrations, including copy number variations (CNVs), might have important roles in autism.11, 12 Methods to detect these genome-wide provide a powerful alternative to traditional gene-mapping approaches for discovering susceptibility genes in autism.11, 13 Recent CNV studies suggest that lesions at many different loci can contribute to autism, a result consistent both with the findings from cytogenetic studies and with the failure to find causal variants.11 These CNVs can be recurrent, inherited, and/or arise de novo. A recent study showed that de novo variants were present in approximately 7% of idiopathic families having at least one child with autism spectrum disorder. The mean size of those de novo variants was approximately 5 Mb (median 3 Mb). The mean number of genes encompassed by these rare structural variants was over 30.14 It has been suggested that these structural variants may account for a larger fraction of the overall genetic risk than was previously assumed.15 Recently, rare microdeletions and microduplications were found in autistic individuals on 16p11.2, containing 25 annotated genes or transcripts, of which several could be considered good candidates for driving the phenotype on the basis of their expression in the brain or function in neurodevelopment.16

When assuming that these loci confer susceptibility to autism, the considerable amount of genes within each CNV suggests that it is possible that some of these rare CNVs represent a contiguous gene syndrome. Williams–Beuren syndrome, a neurodevelopmental disorder attributed to a deletion of 7q11.23, is a prime example of such a contiguous gene syndrome: The deleted region includes more than 20 genes, and it is believed that the characteristic features of this disorder are because of the loss of multiple genes, of which several are presumed to be responsible for a subset of the Williams–Beuren syndrome symptoms.17

Although it can be that within each autism susceptibility locus only one single gene raises the susceptibility to autism, we hypothesize here that for some of these loci autism resembles a contiguous gene syndrome, caused by (de novo) aberrations within multiple (contiguous) genes. These rare de novo structural mutations can result in the disruption of multiple biological functions as multiple genes reside within these loci, resulting in phenotypes that can easily be related to the autistic phenotype, but that can also be rather atypical.18

To substantiate the hypothesis of autism as a (partly) contiguous gene syndrome, we sought evidence that certain atypical symptoms co-occur with autism: For a set of loci that have already been implicated in autism12 we systematically investigated all positional candidate genes and determined what symptoms are usually caused by aberrations within each of these genes. This analysis was performed by text mining the Online Mendelian Inheritance in Man (OMIM) database. In OMIM, considerable information is present describing rare monogenic mutations that cause rare—most often serious—diseases. We hypothesized that these rare diseases with serious phenotypes can be informative for complex diseases with a more subtle phenotype, such as autism, that have not been described as extensively within OMIM. Part of the symptoms caused by rare monogenic variants may be found with more subtle presentation in complex disorders that have other less deleterious variants within the same genes. This hypothesis is supported by recently identified genetic variants that influence adult height variation.19, 20, 21 For several of the genes to which these variants map, monogenic mutations are known (and reported within OMIM) that lead to rare syndromes and symptoms that affect skeletal development, such as skeletal dysplasia. Comparable findings now exist, for example, diabetes and lipid levels. As such, we argue that a study of rare syndromes and symptoms of genes in loci, implicated in autism, might be useful in identifying clinical features that are common to autism spectrum disorders.

Once we had determined what syndromes and symptoms could be caused by mutations in genes residing in the autism loci, we assessed for each identified symptom whether more than one of the autism susceptibility loci could cause this symptom and determined whether the amount of loci that could cause this symptom (through affected genes within these loci) was significantly higher than expected by chance (Figure 1).

Figure 1
figure 1

Overview of identification of overrepresented symptoms in autism loci. For many loci, associated with autism, no genes have been unequivocally shown to be associated. We have assumed that a systematic analysis of symptoms caused by aberrations in positional candidate genes in these loci might reveal symptoms that are present more often than one would expect by chance, and thus might co-occur with autism. First, loci identified through cytogenetic and linkage analysis are used as input (Step 1). In this example three loci have been identified, each of them causing known diseases or syndromes when mutated (Step 2). For each disorder, we subsequently determine the associated symptoms (Step 3). MeSH is then used for the generation of standardized codes, which are hierarchically organized, allowing to be both specific and generic at the same time (eg, ‘spine’ is specific, ‘bone and bones’ is generic) (Step 4). Once all the symptoms have been recoded, it can be determined what symptoms are expressed per locus (Step 5), allowing for the identification of overrepresented ones (eg, ‘bone and bones’) (Step 6).

The identification of certain symptoms, reported more often in these loci than expected, would substantiate this hypothesis, and additionally might help to identify symptoms that have not yet been described to co-occur with autism and which could be relevant for the clinic.

Knowledge on these symptoms is relevant for genetic research as well: When assuming that a symptom is expressed in only a subset of patients, it might be worthwhile to condition on this symptom. Although such symptoms can arise because of aberrations within multiple loci, it can be that within these patients additional (shared) susceptibility loci exist that help cause this particular symptom (as for complex diseases, different genetic aberrations can result in identical symptoms and phenotypes through disruptions within the same biological cascades,17 the concept of convergence). By grouping autism individuals sharing such symptoms, it might subsequently be possible to increase statistical power to identify those additional susceptibility loci that are shared among these individuals. In a recent study, three schizophrenia patients were identified with deletions in CNTNAP2, a known epilepsy susceptibility locus. It turned out that these individuals also had epilepsy.22

Taken together, defining subgroups on the basis of clinical presentation (representing disruptions within the same biological pathway) could be useful for follow-up research.23, 24

Materials and methods

Definition of susceptibility loci

Loci for autism were selected on the basis of evidence from both linkage and cytogenetic studies. Of all the linkage studies that we had previously analyzed,12 four studies that had at least one locus with a multipoint logarithm of the odds score above 3.0 were included in the analysis.25, 26, 27, 28 Given that no unequivocal method of defining the extent of the region is provided in the literature, and the fact that information about a logarithm of the odds-1 drop region was not always present, boundaries of the linkage regions were pragmatically defined at both ends of a 20 MB base-pair block centered around the most significantly linked marker in each locus.

Definition of the Cytogenetic Regions of Interest (CROIs) was based on criteria that have been previously described.12 In short, regions on the human genome where multiple overlapping cytogenetic abnormalities co-occurred with an autism phenotype were identified through extensive literature search. Only CROIs that contained more than five overlapping cases were included for analysis. Cases involving chromosomal mosaicism or well-described gene mutation as the most likely genetic cause for autism were excluded (for example, patients with fragile X syndrome caused by Fmr1 mutations). In total, we defined 13 loci, of which six were based on linkage data and seven were based on cytogenetic data (Table 1). The NCBI V35 assembly was used to physically map all markers, probes, and banding information.

Table 1 Autism susceptibility loci included in the analysis

Identification of syndromes and subsequent symptoms caused by aberrations

The OMIM database catalogs the majority of all known diseases that have genetic components providing extensive information on both clinical aspects and the genetic basis of these syndromes. We determined which syndromes were caused by aberrations that were (partly) overlapping with each of the 13 loci (Figure 1). OMIM provides a clinical synopsis describing the core symptoms caused by each disorder. As this information is both well organized and extensive, we chose this repository as the basis for collecting symptom information for each syndrome. We included only the core clinical manifestation information, and not the entries contained in the ‘miscellaneous,’ ‘molecular basis,’ and ‘inheritance’ sections, because these never describe actual symptoms. For each entry, only the complete text was used to prevent subsets of phrases being incorrectly attributed (eg, ‘spot quality assessment’ was taken as a whole, because ‘spot’ can be interpreted to be a symptom in ‘Exanthema’).

Subsequently, the Medical Subject Headings (MeSH) vocabulary29 was used to code these symptoms displayed within disorders in a standardized way. This transformation could be applied as the MeSH ontology is hierarchically organized, allowing one to describe specific symptoms (eg, ‘spine,’ MeSH code ‘A02.835.232.834’), but be generic at the same time (‘spine’ is part of the parent MeSH term ‘bone and bones,’ MeSH code ‘A02.835.232’). As such, slightly different but related symptoms (eg, ‘skull,’ ‘spine,’ and ‘thorax’) all share a more generic parent MeSH term (‘bone and bones’), which enabled us to associate these symptoms with each other through a common parent term.

To ensure that the automatic assignment of clinical synopsis information to MeSH terms was performed with high accuracy, we also manually assigned all the symptoms for the syndromes contained in the 13 loci to MeSH terms. This manual curation resulted in a conversion table, which maps clinical synopsis entries to known MeSH terms (Supplementary Table S1). This allowed for the automatic extraction of information, by text mining30, 31, 32, 33 OMIM, and MeSH, and through the conversion table it increased the yield of clinical synopsis assignments to MeSH.

Analysis of over-represented symptoms in loci

We then traversed all MeSH terms, including those that had been explicitly mentioned, along with their more generic parent and grandparent MeSH terms, and determined in how many loci each MeSH term was reported at least once. Once this was assessed, we determined whether any of these MeSH terms had been described within more loci than expected by performing a 10 000 fold permutation analysis on the data. In each permutation, the 13 loci were shuffled randomly across the genome and the text mining analysis was performed again on these permuted loci. For each MeSH term, the number of shuffled loci in which this term had been described was determined and this number was compared with the original number of loci in which this term had been described. Consequently, after these 10 000 permutations, for each MeSH term an empiric P-value could be determined.

To identify potentially common symptoms, we only followed up MeSH terms that were present in at least four loci. As our strategy was to determine potentially relevant novel symptoms in autism, we deemed a symptom interesting when its empirically determined P-value was below 0.05. We assessed whether the number of identified symptoms with an empiric P-value below 0.05 was significantly more than expected by a 1000-fold permutation analysis. We shuffled the loci randomly across the genome and determined for each permutation how many terms had a P-value below 0.05, using the same filtering as we had applied for the original CROIs. This enabled us to empirically determine whether the amount of nominally significantly identified symptoms was more than expected.

Results

An overview of the 13 selected loci is shown in Table 1, along with the evidence for their inclusion (linkage results or cytogenetic region of interest). To ensure that the loci that were identified through cytogenetic analyses were potentially specific to autism and were not commonly deleted or duplicated, we investigated each locus in the Database of Common Genetic Variations.34 None of these loci were known to contain aberrations in healthy individuals as extensive as the ones observed within autism patients.

Once these loci had been defined, OMIM was assessed to determine, which known syndromes are caused by mutations in each of these loci. Subsequent analysis of the clinical synopsis information for each syndrome and mapping to MeSH terms allowed us to extract a standardized set of symptoms. Assignment of symptoms through the use of the manually curated conversion table (Supplementary Table S1) resulted in the assignment of over 500 extra symptoms to MeSH terms. Although this increase of assignment was considerable and as accurate as possible, mining OMIM and mapping of symptoms to MeSH terms were sometimes problematic, as outlined in Table 2.

Table 2 Overview of difficulties for text mining in OMIM and using MeSH

Once all syndromes had been processed, we assessed per MeSH term the number of loci in which this term was mentioned. To establish whether any term was over-represented, that is, present in more loci than expected by chance, a permutation analysis was performed, which allowed for the determination of an empiric P-value for each term (Figure 1; Supplementary Table S2). As we had manually translated the clinical synopsis information for the syndromes that mapped within our 13 loci to MeSH terms, we wanted to ensure that clinical information for syndromes residing outside of these loci could also be mapped using this translation table. If a slightly different phrasing of symptoms had been used in syndromes that we had not manually assessed, as they mapped outside of our 13 loci, this could influence the accuracy of the empirically determined P-value. This was, however, not the case, as the results from an analysis that relied entirely on the automatic translation of clinical synopsis symptoms to MeSH terms (Supplementary Table S2) gave comparable results to an analysis that included the manual assignment (Supplementary Table S3).

As autism, Asperger's disorder, and RETT syndrome had already been described (OMIM numbers 209850, 608636, 607373, 300495, 312750, 608638, and 300497) in four out of the 13 loci, this allowed for an initial validation of our method. Symptoms mentioned for these syndromes could be attributed to the MeSH term ‘Child behavior disorders,’ for which the empirically determined P-value was 0.01 (Supplementary Table S3). To prevent a bias toward autism symptoms already described in OMIM, we excluded these syndromes, along with autism-related syndromes that were defined within OMIM (OMIM numbers 606053, 609378, 611015, 611016, 605309, 608049, 610676, 610836, 300425, 610838, 300496, 610908, 300672, 300624, 608631, 300494, 609954, 608781)—but which mapped outside of our 13 loci—from further analyses (Table 3).

Table 3 Significantly over-represented symptoms mentioned in at least four loci

Although 33 over-represented symptoms were observed, it should be noted that many different symptoms had been assessed within this analysis, requiring us to control for multiple testing issues. To do this, we performed a permutation analysis to determine whether the number of 33 over-represented symptoms was significantly higher than expected. This was indeed the case (empiric P-value=0.037), indicating that some of the reported symptoms are likely to reflect true-positive findings. To ensure the robustness of this analysis, we assessed whether the number of genes within the CROIs differed from the average number of genes in the permuted loci, but did not observe a difference (1508 genes map within the CROIs, opposed to on average 1569 genes within the permuted loci, empiric P-value=0.47). Subsequent inspection of the most significantly over-represented symptoms (Table 3) suggests that some of these are related (Table 4). Notable are epilepsy/seizures and craniofacial abnormalities, as these have been previously implicated in autism.35, 36 Furthermore, the results indicate that most of these symptoms affect tissues that are of ectodermal origin (Figure 2a). They develop in the first and second trimester of pregnancy and affect many organs (Figure 2b).

Table 4 Clustering of significantly over-represented symptoms in at least four loci
Figure 2
figure 2

Overview of overrepresented symptoms in autism loci. Overrepresented symptoms (empiric P-value < 0.05), present in at least four loci, are shown along with the responsible organs. When possible, symptoms were assigned to a trimester of pregnancy and germ layer. The majority of symptoms are of ectodermal origin, while the majority of affected organs develop in the first to second trimesters.

Discussion

Through text mining of syndromes caused by aberrations in 13 linkage regions and CROIs, this study suggests that various symptoms co-occur with autism that have not yet been widely studied or previously described: We found 33 symptoms that were present in these regions more often than expected by chance (nominal empiric P-value < 0.05). Through subsequent permutation analyses, we observed that this number of 33 over-represented symptoms is higher than expected. These observations support our hypothesis that autism might partly be a contiguous gene syndrome, in which the function of multiple positional candidate genes within the susceptibility loci is affected. This would result in various different clinical manifestations that might be quite atypical, but jointly might also be able to cause autism-like features, which is supported by reports on Xp22.3 deletions, in which patients show the variable association of apparently unrelated clinical manifestations.37, 38 Jointly, the multiple genes with their resulting clinical phenotypes could increase the probability of developing autism. Additional support for autism as a partly contiguous gene syndrome, and the probable existence of different subtypes, comes from CNV studies identifying rare variants covering multiple genes of different function in the etiology of autism.18 For some of the co-occurring symptoms, evidence already exists that they indeed have a role in autism. The most prominent are epilepsy/seizures and craniofacial abnormalities, which have been previously mentioned as possible genetically informative phenotypes in autism.35, 36

Epilepsy is one of the best known and validated associations with autism.39, 40, 41 It is much more common in people with autism than in the general population and, vice versa, it appears that autism and autistic-like conditions are more common in people with epilepsy. Recent studies suggest that more than one-third of the children with autism develop epilepsy.39, 41 About 15–20% of all people with autism had seizures before the age of 3 years.40 The prevalence rates of epilepsy and the types of seizures seem to depend on the level of mental retardation, age, and incidence of regression.39, 41 Not surprisingly, this comorbidity led to researchers proposing that these diseases share common pathophysiological mechanisms.41 The observed over-representation of seizures within this study supports these hypotheses because our method assumes that the same genetic background can yield both autism and other symptoms.

Minor physical anomalies, such as craniofacial abnormalities, in association with autism, have also been mentioned frequently.42, 43, 44, 45, 46 Numerous case reports of thalidomide-induced autism suggest abnormal development very early in the gestation, resulting in craniofacial abnormalities.4, 42, 45, 46 Although most of these physical anomalies are also sometimes observed in other developmental disorders and in normally developing children as well,47 craniofacial abnormalities might be potentially interesting because of their higher frequencies in autistic patients.48

Many parents report gastrointestinal symptoms in their autistic child,49 in line with the digestive system disease symptoms we report. Although gastrointestinal problems are also fairly common in normally developing children, it has been estimated that they affect 46–84% of autism patients.50, 51 Chronic diarrhea, increased bile fluid output, constipation, and increased intestinal permeability are the most frequently mentioned abnormalities in autistic children.49, 50, 51

Limited evidence is available for the involvement of cranial nerves in autism. Although not convincing, a few studies on thalidomide-induced autism have suggested that the cranial nerves could be dysfunctional.45, 52 In this form of autism, individuals showed abnormalities in eye movement and facial expression. Other support comes from the observation that the exposure period for thalidomide autistic individuals is during days 20 and 24 of gestation. Few neurons form in this period, but the motor neurons of the cranial nerves are a notable exception. Interestingly, these nerves operate the muscles of the ears, jaw, throat, tongue, face, and eyes.45

For other symptoms, such as skin, bone, and urogenital problems, hypokalemia, and hypogonadism, there is little evidence of association with autism. Although skin symptoms, such as eczema,53, 54, 55 and occasionally bone problems56, 57 have been reported, evidence for the presence of other symptoms is not available.

We should emphasize that although we observed a higher number of over-represented symptoms than expected (P=0.037, between 7 and 36 symptoms had been found in the 1000 permutations), some of these symptoms are likely to be false positives as none of these symptoms individually attained significance after stringent Bonferroni correction. But, taken together, some of these symptoms might have a role in autism. They could be considered as inclusion or exclusion criteria for future research, defining etiologically more homogeneous subgroups of autistic patients, by disrupting the same biological pathways, either caused by the same genetical aberration or not.

Limitations of our study

Although this method has identified various symptoms that are likely to co-occur with autism, we are aware of a number of limitations in our methodology. One important issue is that this study does not unequivocally prove that these symptoms, for which there is no evidence in the literature, are truly associated with autism. It could also be that they have never been studied, as in the clinical setting most attention is usually devoted to a triad of features: social impairments, communication impairments, and restricted repetitive behaviors and interests.

Another issue is how to determine what are the appropriate criteria for including a susceptibility locus. We tried to do this as carefully as possible, but it is possible that some are false positives. Other loci may well have been overlooked. Apart from these statistical power issues, there are no clear definitions on how to determine the exact boundaries of linkage regions and there is no consensus on whether to include only linkage regions that have shown significant linkage, or to also include loci that were suggestive of linkage. The cytogenetic regions of interest show comparable problems: how many overlapping cases are required to consider regions interesting is somewhat arbitrary, and again, how to define the boundaries of these loci accurately is open to discussion.

Not without their own problems are the use of OMIM and MeSH: as OMIM was designed to be interpreted by humans, there was no immediate need to use a standardized system for coding phenotypes. Consequently, when performing automated text mining in OMIM, various problems became apparent (as described in Table 2). Although manual curation partly overcame these problems, as it enabled the assignment of a substantial number of extra symptoms to MeSH terms, a more standardized method for describing symptoms in OMIM and in MeSH is desired, as previously suggested.30, 33, 58

Recently, some studies have been published that also use text mining to associate different types of information, a few of which also take OMIM and MeSH into account: Van Driel et al.33 associated different phenotypes with each other using OMIM and MeSH; Butte et al.30 associated phenotypes with expression data, and Lage et al.31 have associated syndromes with protein complexes through text mining of OMIM and protein–protein interaction studies. However, as far as we are aware, no study has used OMIM and MeSH to assess whether there are any symptoms over-represented in multiple loci that have been implicated in complex diseases, such as autism, to provide leads for the involvement of unreported symptoms in these disorders.

Although much work remains to be performed to validate the actual co-occurrence of these symptoms in autistic patients, this study might be useful in pointing to ways for better characterizing patients, thereby providing new avenues for biologically informative phenotypes, which could lead to the identification of etiologically more homogeneous groups in patients and increase the statistical power to detect genetic associations. In addition, this method can easily be applied to other psychiatric disorders, as the input for our method consists solely of a set of susceptibility loci and an optional OMIM ‘Clinical Synopsis to MeSH term’ conversion table. This method will allow researchers to gain insight into the potential involvement of unreported symptoms associated with other psychiatric disorders as well.