UTMOST, a single and cross-tissue TWAS (Transcriptome Wide Association Study), reveals new ASD (Autism Spectrum Disorder) associated genes

Autism spectrum disorders (ASD) is a complex neurodevelopmental disorder that may significantly impact on the affected individual’s life. Common variation (SNPs) could explain about 50% of ASD heritability. Despite this fact and the large size of the last GWAS meta-analysis, it is believed that hundreds of risk genes in ASD have yet to be discovered. New tools, such as TWAS (Transcriptome Wide Association Studies) which integrate tissue expression and genetic data, are a great approach to identify new ASD susceptibility genes. The main goal of this study is to use UTMOST with the publicly available summary statistics from the largest ASD GWAS meta-analysis as genetic input. In addition, an in silico biological characterization for the novel associated loci was performed. Our results have shown the association of 4 genes at the brain level (CIPC, PINX1, NKX2-2, and PTPRE) and have highlighted the association of NKX2-2, MANBA, ERI1, and MITF at the gastrointestinal level. The gastrointestinal associations are quite relevant given the well-established but unexplored relationship between ASD and gastrointestinal symptoms. Cross-tissue analysis has shown the association of NKX2-2 and BLK. UTMOST-associated genes together with their in silico biological characterization seems to point to different biological mechanisms underlying ASD etiology. Thus, it would not be restricted to brain tissue and it will involve the participation of other body tissues such as the gastrointestinal.


Introduction
Autism spectrum disorders (ASD) includes a range of neurodevelopmental disorders (NDDs) with onset in early development that are characterized by deficits in communication and social interactions, as well as by repetitive patterns of behavior and restrictive interests 1 . ASD is a complex genetic disorder, involving both environmental and genetic factors. Although an important part of the genetic architecture of ASD is unknown, it is considered that thousands of genes may be involved even most of them remain unidentified and functionally uncharacterized. Rare genetic variation only explains 3% of ASD genetic risk even if it confers a high individual risk 2 . However, common variation (SNPs; single nucleotide polymorphisms) could explain about 50% of ASD heritability. The most recent and the largest ASD GWAS meta-analysis done until now, including 18,381 ASD cases and 27,969 controls, has reported 93 genome-wide significant markers in three separate loci (top SNP: rs910805; p-value: 2.04 × 10 −9 ) 3 . Another methodological approach for common variation are gene-based association analysis (GBA) methods that employ the p-values for each SNP within a gene to obtain a single statistic at this level.
Thus, MAGMA has identified 15 genes, most of them located near the genome-wide significant SNPs identified in GWAS, but 7 genes have revealed association in four additional loci (KCNN2, MMP12, NTM, and a cluster of genes on chromosome 17) 3 . Additional GBA methods using other algorithms as PASCAL have helped to define the association of other genes located in the same LD region than those found by MAGMA (NKX2-4, NKX2-2, CRHR1-IT1, C8orf74, and LOC644172) 4 .
In addition to GBA, bioinformatic approaches that integrate functional data are increasingly used to highlight new genes underlying GWAS summary statistics. Transcriptomewide association studies (TWAS) have emerged as useful tools to study the genetic architecture of complex traits. Among them, MetaXcan 5 and FUSION 6 are well-known TWAS methods.
UTMOST (unified test for molecular signatures) has been recently reported as a novel framework for single and cross-tissue expression imputation. UTMOST is able to consider the joint effect of SNPs (summary statistics) across LD regions (1000 Genomes) and to integrate tissue expression data (GTeX) creating single and cross-tissue covariance matrices that will help to define the gene-trait associations. UTMOST performance was demonstrated at several levels and its accuracy was also well proved as it was able to identify a greater number of associations in biologically relevant tissues for complex diseases 7 .
The main aim of this paper is to further mine the summary data from the largest ASD meta-analysis using UTMOST. In addition, an in silico biological characterization for the novel associated loci will be carried out using bioinformatic approaches (DEG, pathway, gene network, and an exploratory enhancer analysis).
Overall, our results have demonstrated the association of CIPC, PINX1, NKX2-2, and PTPRE at the brain level and have also revealed the relevance of gastrointestinal tissue in ASD etiology through the association of other genes (NKX2-2, MANBA, ERI1, and MITF).

Datasets
Summary statistics from the latest ASD GWAS metaanalysis were obtained from the public repository available in the PGC website (http://www.med.unc.edu/pgc/resultsand-downloads). The following data set was employed: iPSYCH_PGC_ASD_Nov2017.gz (Grove et al. 3 ) which includes the meta-analysis of ASD by the Lundbeck Foundation Initiative for Integrative Psychiatric Research (iPSYCH) and the Psychiatric Genomics Consortium (PGC) released in November 2017. The data set comprises a total of 18,381 cases and 27,969 controls. Additional information about the genotyping, QC methods, and Ethics Commitees as well as informed consents employed in are available at the PGC website and in the previous Grove et al. 3 study.
Morpheus software (https://software.broadinstitute.org/ morpheus/) was used to display the Z scores of UTMOST significant genes across GTeX Brain tissues. UTMOST significance as a Z score in brain tissues is ∼4.6. Gray squares in the heatmap indicate that the gene weights were not available in the target tissue (Fig. 1).

DEG and gene expression analysis with FUMA
GENE2FUNC, a tool of FUMA 8 (https://fuma.ctglab. nl/), was employed to carry out a gene expression heatmap and an enrichment analysis of differentially brain expressed genes (DEG) using BrainSpan RNA-seq data. Those genes represented in Table 4 (bold: PTPRE, CIPC, NKX2-2, PINX1) were used as an input. Expression values are TPM (Transcripts Per Million) for GTEx v6 and RPKM (Read Per Kilobase per Million). In order to define DEG sets, two-sided Student's t-test were performed for these genes and per tissue against the different tissue types or developmental stages. Those genes with a p-value < 0.05 after Bonferroni correction and a log fold change ≥0.58 are defined as DEG. The direction of expression was considered. The −log10 (p-value) refers to the probability of the hypergeometric test DEG analysis was carried out creating differentially expressed genes for each expression data set (Fig. 2). Heatmaps display the normalized expression value (zero mean normalization of log2 transformed expression), and darker red means higher relative expression of that gene in each label, compared to a darker blue color in the same label (Fig. 3).

GeneMANIA and Metascape analysis
GeneMANIA 9 (https://genemania.org/) was used to build a gene network for the UTMOST-associated genes by the single tissue analysis (brain and gastrointestinal tissues) and by the joint tissue analysis (Table 4). Each gene-network was subsequently analyzed with Metascape (https://metascape.org/) 10 to carry out a pathway enrichment and a protein-protein interaction enrichment using the Evidence Counting (GPEC) prioritization tool. For each given gene list, pathway and process enrichment analysis has been carried out with the following ontology sources: GO biological processes, GO cellular components and GO molecular functions. The enrichment background includes all the genes in the genome. Terms with a p-value < 0.01, a minimum count of 3, and an enrichment factor >1.5 (the enrichment factor is the ratio between the observed counts and the counts expected by chance) are collected and grouped into clusters based on their membership similarities. More specifically, p-values Table 1 UTMOST Single tissue analysis (Brain tissues). GWAS genome-wide association study, SNP single nucleotide polymorphism, TWAS transcriptome-wide association study.
*Marginally significant. Fig. 1 Heatmap representing UTMOST significant genes across weight sets and brain areas. UTMOST significance as a Z score is 4.6. Gray squares indicate that gene weights were not available in the target tissue.
are calculated based on the accumulative hypergeometric distribution, and q-values are calculated using the Benjamini-Hochberg procedure to account for multiple testings. Kappa scores are used as the similarity metric when performing hierarchical clustering on the enriched terms, and sub-trees with a similarity of >0.3 are considered a cluster. We select the terms with the best p-values from each of the 20 clusters, with the constraint that there are no more than 15 terms per cluster and no more than 250 terms in total (Figs. [4][5][6]. The network is    Enhancer analysis (dbSUPER) dbSUPER (https://asntech.org/dbsuper/) was used to perform an exploratory enhancer analysis for the UTMOST-associated genes (single-tissue analysis) and for those tissues (brain and gastrointestinal) available at dbSUPER. The parameters selected were: SEs ranking method (H3K27ac), the peak calling was done with MACS (version 1.4.1) with parameters -p 1e-9,-keep-dup = auto, -w -S -space = 50, the stitching distance was established at 12,5 kb, the TSS exclusive zone was set at ±2 kb and the enhancer gene assignment was done within a 50 kb window.

UTMOST analyses and comparison with previous results
UTMOST single tissue analysis (Brain tissues) showed association of two loci, NKX2-2 and PTPRE, while other two loci, CIPC and PINX1, showed a marginal association according to Bonferroni threshold (Table 1, Supplementary.csv files). NKX2-2 was previously identified as an associated gene by two different GBA algorithms, MAGMA, and PASCAL 3,4 . The results obtained by UTMOST also serve to indicate the nucleus accumbens basal ganglia as the brain area in which these genes may be acting. Although the association of PTPRE was not obtained as such in previous analyses, the association of its neighboring gene, C8orf74, was noted in a previous study and one of the index SNPs in the latest ASD GWAS was located near PINX1 (rs10099100) ( Supplementary  Fig. 1a, b) 3,4 .
NKX2-2 was also marginally associated by UTMOST within non-brain tissue together with other genes. It should be noted that some genes are tissue-specific for gastric and intestinal tissues such as stomach, esophagus and colon (MANBA, ERI1, MITF, and NKX2-2). The association of NKX2-2 in colon is noteworthy because NKX2-2 was previously reported as an ASD risk gene and  now it is highlighted again by UTMOST in brain tissues. It seems that NKX2-2 together with BLK may play a role in ASD etiology but not only at the brain level since UTMOST cross-tissue analysis also found association for both genes (Tables 2, 3, Supplementary Figs. 1a, 2a,

b, Supplementary.csv files)
To evaluate the importance of each brain tissue in ASD etiology, a secondary analysis was performed using Table 5 Top 11 clusters with their representative enriched terms (one per cluster) for the associated genes (CIPC PINX1, NKX2-2, and PTPRE) (Single tissue analysis (Brain)) and their interactors. "Count" is the number of genes in the user-provided lists with membership in the given ontology term. "%" is the percentage of all of the provided genes that are found in the given ontology term (only input genes with at least one ontology term annotation are included in the calculation). "Log10(P)" is the p-value in log base 10. "Log10(q)" is the multi-test adjusted p-value in log base 10. Table 6 Protein-protein interaction network and MCODE components identified in the gene lists for the associated genes (CIPC PINX1, NKX2-2, and PTPRE) (Single tissue analysis (Brain)) and their interactors.  "Count" is the number of genes in the user-provided lists with membership in the given ontology term. "%" is the percentage of all of the provided genes that are found in the given ontology term (only input genes with at least one ontology term annotation are included in the calculation). "Log10(P)" is the p-value in log base 10. "Log10(q)" is the multi-test adjusted p-value in log base 10.
Z scores values for each ASD-associated gene across GTeX brain tissues. The heatmap showed that there is a wide specificity of association in terms of Z score for each gene and brain tissue indicating the importance of conducting tissue specific analyses such as UTMOST (Fig. 1).

DEG analysis with GENE2FUNC tool (FUMA)
DEG Analysis with BrainSpan data (11 general developmental stages of brain samples and 29 different ages of brain samples) for the brain associated set of genes have not shown any significant result (Fig. 2a, b). However, CIPC have shown overexpression across every single developmental stage in comparison with the remaining ASD-associated genes (Fig. 3a, b).

GeneMANIA and Metascape analysis
GeneMANIA was used to find out the possible interactors with the associated genes (Table 4). We have proposed three different analyses. One based on the genes associated in brain tissues; another one based on Table 8 Protein-protein interaction network and MCODE components identified in the gene lists for the GI associated genes (and their interactors. the associated genes at a gastrointestinal level, given the associations pointed out by UTMOST and the previous implications of gastrointestinal abnormalities and symptoms in ASD and the lack of biological knowledge about them. The final analysis is focused on the genes identified by the cross-tissue analysis and their interactors. The general goal is to delineate the biological pathways underlying each group of genes and the differences between them. Gene network, enriched ontology clusters, and PPI interaction analysis for the brain-associated genes (CIPC, PINX1, NKX2-2, and PTPRE) by UTMOST Single tissue analysis and their interactors highlights different biological pathways mainly involved in telomere maintenance and transcription regulation (Tables 5, 6 and Fig. 4). However, associated genes in gastrointestinal tissue (NKX2-2, MANBA, ERI1, and MITF) and their interactors mainly regulate DNA transcription by RNA polymerase II and the fate of oligodendrocytes. This is an interesting finding given the possible involvement of these cells in the enteric nervous system in ASD (Tables 7, 8 and Fig. 5). Finally, NKX2-2 and BLK both associated in the crosstissue analysis, seem to point to very diverse biological pathways such as protein tyrosine kinase activity, B cell receptor complex, regulation of DNA-binding transcription factor activity, and neuron projection morphogenesis, among others (Tables 8-10 and Fig. 6).

Enhancer analysis (dbSUPER)
Given the tissue specificity given by UTMOST associations, we found interesting to perform an exploratory analysis of enhancers. According to the dbSUPER dabase only CIPC could work as a superenhancer in the brain middle hippocampus (SE_06106; chr14: 77562660-77607123, size: 44463 pb) ( Supplementary Fig. 3).
As far as we know, this is the first study that has employed the UTMOST framework combined with the summary statistics of the largest ASD meta-analysis. The main aim was to identify ASD tissue-specific genes in brain and/or other tissues. It should be noted at the outset that gene-level associations identified by UTMOST do not imply causality. However, looking at the regional plots for each loci associated by UTMOST, the results shown at brain and gastrointestinal level seems pretty consistent. Thus, UTMOST has served to identify new ASDassociated genes in brain tissues as PTPRE and CIPC. Furthermore, UTMOST has been useful to confirm and obtain information on the tissue in which other known ASD risk genes such as NKX2-2 and PINX1 may have functional relevance. Altogether, brain-associated genes seems to point to three brain areas: cortex, hippocampus and nucleo accumbens. Previous studies demonstrated the ASD-associated gene expression in brain cortex 11 . In addition, hippocampus underlie some of the featured social memory and cognitive behaviors both crucial aspects in ASD 12 . The nucleus accumbens is a key brain area also related with the social reward response in ASD 13 . It should be also noted some limitations of our study. UTMOST uses GTeX v6 data by default and it should be interesting to re-run the analysis once the tool will be updated. Thus, UTMOST could find novel associated genes when expression data from other relevant ASD brain tissues as the amygdala are included. Another limitation of our results is the small number of SNPs Table 9 Top 11 clusters with their representative enriched terms (one per cluster) fot the associated genes by UTMOST Cross Tissue Analysis (BLK and NKX2-2) and their predicted interactors.
considered by UTMOST to create the statistic in some genes (PTPRE in brain tissues and NKX2-2 in non-brain tissues analysis). However, we have been very conservative in establishing Bonferroni correction for all tissues in an analysis group. Thus, for example, we always chose the maximum number of genes tested in one of the brain tissues and applied this to calcule Bonferroni for the whole group of brain tissues as a whole.
In relation to the biological pathways in which UTMOST-associated genes are involved, our results open possible avenues for future genetic and functional studies. Thus, the functional role in ASD of CIPC, PINX1, NKX2-2, and PTPRE has not yet been characterized in detail. Metascape analysis have provided an insight revealing their involvement in telomerase maintenance and transcription regulation. Thus, PINX1 (PIN2 (TERF1) Interacting Telomerase Inhibitor 1) enhances TRF1 binding to telomeres and inhibits telomerase activity. It was proved that its silencing compromises telomere length maintenance in cancer cells 14 . In addition, it was recently found that children with ASD and sensory symptoms have shorter telomeres, compared to those children exhibiting a typical development 15 . CIPC (CLOCK interacting pacemaker) is a mammalian circadian clock protein. Recently, circadian rithms were pointed out as involved in brain development and they could underly ASD etiology due to its implication in behavioral processes. Circadian rythms are regulated through several transcription factors in different cellular types, fact that might be related with the GO terms associated with transcription regulation in this study 16 . Moreover, the result of CIPC as a predicted superenhancer highlights its possible functional repercussion. UTMOST found association of several genes in nonbrain tissues. However, we found really interesting to study those genes related to gastrointestinal tissues (NKX2-2, MANBA, ERI1, and MITF). There is a wellknown and established comorbidity among gastrointestinal symptoms and ASD 17 . These clinical associations suggest the implication of gastrointestinal populations of neuronal cells. A mutation in NLGN3 (R451C) has been recently identified in two ASD brothers with GI symptoms. Mice models have demonstrated that R451C alter the number of neuronal cells in the small intestine and impact fecal microbes 18 . These evidence suggest that the role of NKX2-2, MANBA, ERI1, and MITF in gastrointestinal tissue should be further studied.
BLK and NKX2-2 are both associated in the cross-tissue analysis. The importance of UTMOST approach is that is able to show the association of two previously known ASD risk genes but not restricted to brain tissue. These findings lead to a difficult question whether autism can be a multisystemic disorder, something that has been recently pointed out by some authors 19 .
In conclusion, UTMOST, a novel single and cross-tissue TWAS, has revealed new ASD-associated genes. These genes have been characterized at the pathway and gene network level using bioinformatic approaches. However, future tissue-specific functional studies will be key to properly determine their role in ASD etiology.