Introduction

Autism spectrum disorder (ASD) is a neurodevelopmental condition that results in behavioral, social and communication impairments. It is currently estimated that 1 in every 88 children in the United States is affected with ASD, with boys five times more likely to be affected than girls.1 ASD has a substantial genetic component,2, 3, 4 with 88–95% monozygotic twin concordance and an estimated heritability of 60–90%.5 A recent study showed that a large proportion of the variance in liability among monozygotic twins can be explained by shared environmental factors (55% for autism and 58% for ASD) in addition to moderate genetic heritability (37% for autism and 38% for ASD).6 Studies conclude that there are multiple genetic factors that have a role in the etiology of autism. Recent findings have provided evidence in support of roles for de novo mutations,7, 8, 9, 10 common genetic variants,11 rare variants12 and copy-number variation.13, 14, 15 Nevertheless, the genetic basis of the majority of ASD remains largely unclear.

Contributing to the complexity, ASD linkage studies have uncovered over 70 susceptibility loci across the genome and a large number of gene candidates,16, 17 but most of these findings have not been successfully replicated. The only exceptions to this trend have been linkage peaks on 17q11–17q2118, 19, 20, 21 and 7q.22, 23, 24, 25, 26 Yet, linkage and association studies have dominated the approaches to disentangle the genetic etiology of autism for more than two decades, leaving behind a rich legacy of research findings in the biomedical literature. Reports of significant linkage peaks represent an important clue to the genetic cause of autism that should not be ignored, even in the absence of sufficient replication. Aside from the possibility of false positives, absence of replication could be due to several factors such as lack of sample size, differential recombination rates in the replication population, lower coverage in the replication samples of genetic markers in the linkage peaks or batch effects. However, the mechanistic relevance of the marker should still be determined. For example, a marker may designate collections of genes involved in biological processes or individual genes with mutations of high importance to the susceptibility to autism. Furthermore, these markers and their importance to the etiology of autism, once they have achieved the minimum significance threshold of logarithm-of-the-odds of 3.0 or an association P-value of <0.05 (corrected for multiple testing), are usually treated as equal. Therefore, despite the fact that markers provide maps, the granularity of those maps is insufficient to direct prioritized experimental follow-up, as every marker, and every gene proximal to that marker, is equally likely to be as important. Given that markers have been identified on nearly every chromosome, the utility of linkage studies for providing specific gene leads and directing further experimental research is limited.

In the present study, we have focused on maximizing the value of previously published linkage and association findings using families from the Autism Genetic Resource Exchange (AGRE) project for directing further genetic analysis of autism. Specifically, our aim was to provide finer resolution to published linkage and association studies through a novel analytical strategy focused on marker-to-gene male-specific genetic distance. Our study was loosely predicated on the assumption that genes in tight linkage with a susceptibility locus are more likely to be linked with the phenotype of interest, that is, autism, and was leveraged by the collective understanding that the disorder has a substantial male bias. As such, our work focused on reconstructing the male-specific structure of linkage disequilibrium (LD) surrounding significant autism markers to sets of genes in tight, medium and distant LD with those markers. We examined the biological signal inherent to each concept and measured its expression in peripheral blood and postmortem brain tissue from individuals with autism as compared with controls. This strategy improves the resolution of marker-based findings by pointing to the specific genes contributing to the linkage and/or association signals, more likely to have a role in ASD. A large percentage of these genes had not been previously linked to autism but had been implicated in numerous other neurological diseases, including those with overlapping symptoms. Given the ability of this strategy to identify important and novel signal among the rich collection of research findings from various linkage and association studies in autism, we anticipate that it will have broader applications in the study of other complex genetic disorders in which a large collection of samples had been previously typed and not immediately available for modern sequencing.

Materials and methods

Autism marker selection

We first mined the autism literature to identify genetic studies focusing on AGRE families. Owing to the focus on AGRE families, all probands included here were assessed and diagnosed using the same instruments and procedures. We identified 67 reports of significant autism linkage and association signals spanning 18 chromosomes (Table 1). Significance thresholds were a logarithm-of-the-odds score >3, which is suggestive evidence of linkage or corrected-association P-value <0.01 (depending on the number of markers tested in the study). The search was restricted to studies performed on AGRE families because the same subjects were used to calculate the genetic map around autism markers. This strategy allowed us to capture the true rates of recombination in the studied population and avoid any potential recombination bias. As the linkage and association studies were based on various experimental designs, we developed the strategy described below to enable their meta-analysis.

Table 1 Autism markers identified in AGRE families between 2001 and 2012

Each marker was first mapped to the NCBI human genome build 36.3. Then, a 20-Mb slice flanking that genomic coordinate was retrieved and the single-nucleotide polymorphisms (SNPs) within that region were used for calculating a genetic map using the same subjects’ genotypes.11 The nearest SNP to the autism marker was used as the reference for calculating recombination rates with other SNPs. The recombination rates were determined with respect to the reference. We assumed that the recombination rates between the marker and the nearest SNP was negligible, enabling us to designate that SNP as a proxy for the marker. Owing to the heterogeneity in the discovery methods of the various regions (linkage vs association, copy-number variations vs SNPs and so on), we treated each region as equally significant. This enabled us to use an unbiased approach in finding genes and regions that were enriched for autism cases.

Calculation of LD structure of autism markers

In order to establish the male-specific LD structure between genes and autism markers, we created genetic maps from a 20-Mb slice of the chromosome flanking each linkage locus. Specifically, we collected and assembled SNPs 10 Mb upstream and 10 Mb downstream of each autism marker using the SNP data for AGRE probands.11 As autism is almost five times more prevalent in males, we filtered out the females from the data set before calculating the genetic map. These filtration procedures followed the logic that an AGRE data specific and male-only genetic map would be the most likely to provide an accurate reflection of the samples contributing to the linkage and association signals reported in the pooled studies.

To create the genetic maps for each autism marker, we estimated fine-scale recombination rates using the LDhat software package.27 This program estimates recombination rates between adjacent SNPs by fitting a Bayesian model based on coalescent theory to analyze patterns of LD in the data. We conducted this analysis for all 67 markers, identifying the male-specific genetic distances between the marker and genes surrounding that marker, measured in cM. For further filtering, we pruned the genetic map to 15 cM around the marker. A process flow for the creation of these LD structure (LDS) sets is depicted in Figure 1.

Figure 1
figure 1

Integrative genomics workflow for prioritizing candidate genes for further experimentation. (I) The rich collection of genetic studies performed on Autism Genetic Resource Exchange (AGRE) families between 2001 and 2012 was mined to identify genome-wide significant linkage and association signals. (II) Markers were remapped to the current genome build (NCBI human genome build 36.3) and flanking regions extracted. (III) Single-nucleotide polymorphism (SNP) genotypes of AGRE male probands were compiled to enable male-specific genetic distance calculations in the same subjects. (IV) Regional recombination rates between markers and SNPs were calculated and (V) protein-coding genes within 20 male-specific cM from the markers identified. (VI) The expression profiles of these genes were examined in brain and blood of individuals with autism spectrum disorder (ASD) relative to neurotypical individuals. Genes found to be differentially expressed in both tissues and located within the male-specific vicinity of a significant autism marker are considered prime candidates for further studies. Of 30 genes that satisfy these criteria, 19 were previously implicated in disorders that share symptoms and morbidity patterns with ASD.

Messenger RNA expression data processing

Gene Expression Omnibus data sets GSE657528 and GSE2852129 were used to examine the expression of genes surrounding significant autism markers in individuals with ASD. The GSE6575 data set consists of 17 samples of individuals with ASD without regression, 18 individuals with ASD with regression, 9 patients with mental retardation or developmental delay, and 12 typically developing children from the general population. In this previous study, total RNA was extracted from whole blood samples using the PaxGene (Qiagen, Germantown, MD, USA) Blood RNA System and run on Affymetrix U133plus2.0 (Santa Clara, CA, USA). For the purposes of our study, we elected to use the 35 individuals with autism and 12 control samples from the general population. Preprocessing and expression analyses were done with the Bioinformatics Toolbox Version 2.6 (for Matlab R2007a+, Mathworks, Natick, MA, USA). GeneChip Robust Multi-array Average was used for background adjustment, and control probe intensities were used to estimate nonspecific binding.30 Housekeeping genes, gene expression data with empty gene symbols, genes with very low absolute expression values and genes with low variance were removed from the preprocessed data set.

The GSE28521 data set consisted of postmortem brain tissue samples from 19 autism cases and 17 controls from the Autism Tissue Project, using the Illumina (San Diego, CA, USA) HumanRef-8 v3.0 expression beadchip panel. Three regions of the brain previously implicated in autism were profiled in each individual: superior temporal gyrus (also known as Brodmann’s area 41/42), prefrontal cortex (BA9) and cerebellar vermis. Raw data were formatted with log2 transformation and normalized by quantile normalization. We considered probes with detection P-value<0.05 for at least half of the samples for further analysis, as described here.29 Raw P-values were generated using limma/bioconductor package in R software (http://www.bioconductor.org/packages/2.12/bioc/html/limma.html), and Benjamini and Hochberg multiple testing correction was applied to obtain adjusted P-values.

Gene expression profiles around common autism markers

To examine the importance of genes at varying cM distances, and to examine the level of signal relevant to autism surrounding each autism marker individually, we treated each marker region as an independent hypothesis. We then examined the differential regulation of genes within LDS sets using the messenger RNA expression profiles described above. Our hypothesis is that genes at close genetic distances from autism markers will be more differentially regulated than genes not in LD with the autism markers.

Our tests for significant differential expression deviated from standard analyses of microarray data for the primary reason that each LDS set reflected independent, prior biological knowledge. As such, we treated each LDS set as a separate collection of hypotheses, with the number of hypotheses being tested simultaneously equivalent to the number of genes in the set. To appropriately account for this multiple testing, we adjusted the nominal P-values using the q-value calculation,31 a measurement framed in terms of the false discovery rate.32 All 67 LDS sets were investigated in this way to determine the frequencies of significant, adjusted P-values (q<0.05) surrounding each autism marker.

Disease cross-referencing

We mined eight existing gene-disease annotation resources for genes associated with neurological disorders considered to be closely related to autism.33 Diseases included tuberous sclerosis, epilepsy, seizure disorder and many others with established behavioral similarities to ASD. The databases examined included the Genetic Association Database,34 Database of Genomic Variants (http://projects.tcag.ca/variation/), dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/), HuGE Navigator Navigator,35 Human Gene Mutation Database (http://www.hgmd.cf.ac.uk/ac/index.php), Online Mendelian Inheritance in Man (http://www.ncbi.nlm.nih.gov/omim/), GeneCards (http://www.genecards.org/) and SNPedia (http://snpedia.com/index.php/SNPedia). Results from these resources were integrated to create a list of genes and associated gene characteristics, which was used for comparisons with the autism LDS genes.

Results

More than 200 genetic studies were conducted on AGRE families between 2001 and 2012. These were mined to identify 67 genome-wide significant linkage and association signals for ASD (Table 1). Common markers for autism span 18 chromosomes, all with a logarithm-of-the-odds score >3 or a corrected association P-value<0.01. These studies were based on various experimental designs, mostly using multiplex families with affected sib-pairs. We calibrated the positions of significant markers using NCBI human genome build 36.3 (NCBI), and then aggregated all SNPs within a 10-MB window on either side of the marker to calculate the male-specific structure of LD around each marker. Examining the recombination rates in the same subjects allows us to build a population-specific genetic map, eliminating any genetic bias that might arise from considering ethnicity-matched controls.

Our calculations of recombination rates and LD between SNPs and common autism markers identified a total of 1426 genes within 25 cM of the markers. Of those, 697 protein-coding genes were within 5 cM, 450 between 5 and 10 cM and 212 between 10 and 15 cM from the nearest autism locus (Figure 2). Both recombination rates and gene densities varied extensively among autism markers (28.1±7.3 cM in the 20-Mb region around markers, spanning 35.4±10.4 genes). There was a strong correlation (rho=0.7) between the size of the genetic map and the proportion of genes at distances >10 cM. The highest density of genes was around RFWD2 and PAPPA2 on chromosome 1, in a copy-number variation-associated region encoding 60 genes within 24 cM. Forty-eight and 90% of the genes fell within 5 and 10 cM, respectively, indicating that LD was well preserved with increasing distance from the autism locus. In contrast, the region around a common copy-number variations near UNQ3037 on chromosome 3 contained 73% genes at a distance greater than >10 cM.

Figure 2
figure 2

Number of genes within 20 cM of significant autism markers. Genetic distances were calculated using male-only Autism Genetic Resource Exchange (AGRE) proband single-nucleotide polymorphisms (SNPs).11 Genes were grouped into three distance bins indicating the extent of recombination with the autism marker. The figure displays the number of genes in tight linkage with the marker, and therefore the extent of recombination around each locus.

Previous results indicate that the information content varies by marker and genetic distance, but do not directly demonstrate whether this information is of relevance to our understanding of the genetic etiology of autism. To test directly whether specific markers and/or regions surrounding those markers are more likely to contain promising new gene leads, we examined the regulatory patterns of each LDS set independently in two expression data sets obtained from the Gene Expression Omnibus: a blood-based messenger RNA expression data from individuals with autism and controls (GSE6575)28 and a transcriptomic analysis of postmortem brain RNA (GSE28521). In the blood-based expression data set, although the large majority showed no change in expression, 27 marker regions (40%) contained at least one gene with significant, multiple test-corrected differential expression (Table 2). More than 50% of the genes around markers on 3p26 (del CNTN4, del UNQ3037), 3q (D3S3045–D3S1763), 2q (rs17420138) and 5p (rs10513025) were differentially expressed in whole blood from individuals with ASD. In all, 79 genes were significantly enriched at q<0.05 across all the marker sets out of which 31 (39%) and 60 (76%) genes lie within 5 and 10 cM of the nearest autism marker, respectively, further supporting the notion that the genes proximal to the markers represent more viable autism gene leads than genes further away.

Table 2 Differential expression of genes around common autism markers

In postmortem brain tissue data there was an abundance of signal in 64 of the 67 LDS sets, which contained at least one gene at q-value<0.05. Regions around 41 markers contained gene sets with significant differential expression, defined as >50% of gene differentially expressed in at least one brain region between individuals with ASD and matched controls at a q-value threshold of 0.05. Of 383 genes showing evidence of differential expression at q<0.05, 205 (53%) and 323 (84%) lie within 5 and 10 cM of the nearest autism marker, respectively.

Four markers were found to reside within a neighborhood of differentially expressed genes in both brain and blood of individuals with ASD. At least 50% of protein-coding genes around rs10513025, D3S3045–D3S1763, del CNTN4 and del UNQ3037 are differentially expressed in both tissues (Table 2). Three of these regions, 20 Mb around del CNTN4, del UNQ3037 and rs10513025 show heavy recombination and contain 73%, 68% and 47% of genes, respectively, at >10 cM. Despite significant recombination within the region, genes significantly enriched for differential expression in both data sets were those closer to the autism marker. Of 30 genes found to be significantly differentially expressed in both blood and brain of individuals with ASD, 11 and 20 were within 5 and 10 cM of the nearest autism marker, respectively.

Integrating a decade of genome-wide linkage and association studies, the male bias of ASD and differential expression in both brain and blood of individuals with ASD has identified a set of 30 prime candidates for future experimentation, such as efficient targeted resequencing in very large cohorts.36 Of these, CADPS2, CNTN4, NTRK3, SLC9A9 and SUMF1 have been previously implicated in ASD. Other differentially expressed genes within 20 male-specific cM of common autism markers have been implicated in disorders with shared symptoms and morbidity patterns, but have not yet been implicated in ASD (Table 3).

Table 3 Top candidate genes based on integrating a decade of genome-wide linkage and association studies, the autism male bias and differential expression in brain and blood of individuals with autism spectrum disorder (ASD)

Discussion

Despite the high heritability of autism, efforts to identify its genetic causes have enjoyed only limited success. Numerous susceptibility loci have been identified, yet few have been replicated, supporting the notion that the genetic complexity of this disorder outmatches the proportion of the population with autism that has been sampled to date. Until the sampling adequately covers the diversity of genetic systems underlying ASD, we must develop analytical approaches to make optimal use of existing results. To this end, we focused here on the development of a simple strategy aimed at targeting previously published autism markers, as well as genes genetically proximal to those markers and most likely to be causally related to ASD. By coupling the structure of LD with knowledge of biological process and patterns of gene expression data from individuals with ASD, we were able to identify a set of markers and genes proximal to those markers likely to be most informative to the genetic basis of autism. Specific loci on a few chromosomes including three signals on chromosome 3 and one on chromosome 5 yielded the greatest signal, with a sizable percentage of adjacent genes showing highly significant differential expression in blood and brain data from individuals with autism. In support of their relevance to the genetics of autism, many of the differentially expressed genes closely linked to the markers have already been identified as promising autism gene candidates, such as CNTN4, CADPS2, SUMF1, NTRK3 and SLC9A9. In addition, an even greater percentage of these genes have been linked to neurological diseases with high comorbidity and behavioral similarities to ASD.

Overall, our strategy provides a means for meta-analysis of previous linkage and association studies to prioritize both markers and adjacent genes for further experimental analysis. Although our results corroborate the general rule of thumb that genes close to loci identified via linkage and association studies are likely to be informative to the disease under study, they stress that this rule only applies to specific markers. Given the success of application to the autism research field, we expect that our analytical strategy could be of general use in the study of other similarly complex genetic diseases, such as Alzheimer’s disease and type 1 diabetes.