An integrative analysis of non-coding regulatory DNA variations associated with autism spectrum disorder

A number of genetic studies have identified rare protein-coding DNA variations associated with autism spectrum disorder (ASD), a neurodevelopmental disorder with significant genetic etiology and heterogeneity. In contrast, the contributions of functional, regulatory genetic variations that occur in the extensive non-protein-coding regions of the genome remain poorly understood. Here we developed a genome-wide analysis to identify the rare single nucleotide variants (SNVs) that occur in non-coding regions and determined the regulatory function and evolutionary conservation of these variants. Using publicly available datasets and computational predictions, we identified SNVs within putative regulatory regions in promoters, transcription factor binding sites, and microRNA genes and their target sites. Overall, we found that the regulatory variants in ASD cases were enriched in ASD-risk genes and genes involved in fetal neurodevelopment. As with previously reported coding mutations, we found an enrichment of the regulatory variants associated with dysregulation of neurodevelopmental and synaptic signaling pathways. Among these were several rare inherited SNVs found in the mature sequence of microRNAs predicted to affect the regulation of ASD-risk genes. We show a paternally inherited miR-873-5p variant with altered binding affinity for several risk-genes including NRXN2 and CNTNAP2 putatively overlay maternally inherited loss-of-function coding variations in NRXN1 and CNTNAP2 to likely increase the genetic liability in an idiopathic ASD case. Our analysis pipeline provides a new resource for identifying loss-of-function regulatory DNA variations that may contribute to the genetic etiology of complex disorders.


Introduction
Autism spectrum disorder (ASD) is a neurodevelopmental disorder with heterogeneous genetic origins. Recent genome sequencing studies have identified many risk genes from the loss-of-function protein-coding variants, which has driven a move toward analysis of convergent risk pathways [1][2][3][4]. However, less is known about the contribution to risk by the variants in non-coding regulatory regions, which hold the potential to disrupt the finely tuned biological pathways involved in brain development, as demonstrated by a recent study in the influence of the 3′ untranslated regions (3′UTR)-regulatory variants in language impairment [5]. Nevertheless, progress is being made by using whole-genome sequencing of the ASD families. In a recent landmark-autism study, non-coding de novo DNA variations were found to be enriched in the untranslated regions of genes, gene boundaries, and DNase I hypersensitive regions [6]. Similarly, a number of previous exomesequencing studies including our own [7] have shown that there are many DNA variations that occur in the non-coding regulatory regions of genes, albeit of unknown functional significance.
The functional impact of the non-coding variants has been difficult to interpret due to the lack of specific knowledge concerning their contribution to the regulatory regions that control and modulate gene transcription. Even in cases where a variant in a non-coding region is found to be associated with a disorder through linkage analyses, it is a challenge to assess whether such variations have gene regulatory functions, or whether the risk lies elsewhere in the linkage region [8]. The problem is compounded with the de novo mutations, which are unlikely to be represented in public databases, despite their significant contribution to ASD [6,7,[9][10][11]. Much of this difficulty arises from the lack of data integration and functional sequence analysis surrounding transcription control, the molecular interactions, and transactions, of which are poorly understood when compared to the classic set of rules (also known as genetic code) by which genetic information encodes the protein sequences. In this regard, the identification and contribution of the non-coding regulatory variants to ASDassociated biological pathways remains unknown.
The availability of large, genome-wide regulatory resources such as FANTOM5 [12], ENCODE [13], and the BrainSpan developmental brain tissue expression datasets [14] have been instrumental in developing modern-systemsbiology-approaches to evaluate the regulatory loss-offunction mutations associated with human disorders. Moreover, these novel approaches involve predicting whether a variant has the potential to change the dynamics of gene networks by disrupting the interactions between key regulatory molecules, such as miRNAs and transcription factors, and their target functional pathways [8,15,16]. There is now an increased focus on ASD-associated mutations within the regulatory regions, particularly near-known ASD-risk genes [17,18], with studies examining the enhancer regions [19] and microRNA targeting [5]. Together these data provide a vision of how we might assess the potential impact of non-coding regulatory DNA variations associated with ASD.
In this study, we developed a systems-based analysis to identify the non-coding regulatory variants using the previously published whole-exome sequencing (WES) data of the ASD families [7], and focusing on the microRNA genes and the gene-proximal cis-regulatory variants. Single nucleotide variants (SNVs) in the regulatory regions were filtered by rarity and functional score. We compiled a genome-wide resource of regulatory regions covering microRNA genes and their putative target sites, transcription start sites, and transcription-factor-binding sites (TFBS) in promoters, 5′ and 3′ UTRs, to evaluate the relationship of these regulatory SNVs to the synaptic pathways. We further characterized these regulatory mutations using ASD-related databases and functional network analyses to show that rare heterozygous loss-of-function coding and non-coding variations combine as the probable causal elements. We specifically highlight the functional significance of a SNV found in microRNA-873, which affects target binding affinity and regulation of the key ASD-risk genes.

Materials and methods
Annotation and filtering of whole-exome sequencing data The WES data used in this study was from our previous study of Australian ASD families (n = 128; 48 ASD cases, and 80 parents) [7], and we considered the single nucleotide variants (SNVs) for this analysis as there were no large copy number variations (CNVs) detected in this sample (data not shown). The participants were recruited from the Western Australian Autism Biological Registry at the Telethon Kids Institute in Perth, Australia [20] and the blood samples were collected and analyzed with the approval by the Ethics Committee's at the Princess Margaret Hospital for Children, University of Western Australia (1845/EP) and the University of Queensland (2012000269). Informed written consent was obtained from next of kin, caretakers, or guardians on behalf of the minors/children enrolled in our study. In this study, we have mapped our WES data (NimbleGen SeqCap version 3) to the hg19-reference genome with the BWA-MEM algorithm [21]. Duplicate reads were removed using Picard tools v2.60. Indel realignment and base quality score recalibration were performed using GATK v3.5 [22]. The germline SNVs and indels were called using the GATK HaplotypeCaller tool and only high quality SNVs were selected for further analysis (RMSMappingQuality, MQ > 40; QualByDepth, QD > 2). We further selected only SNVs at genomic positions covered by more than 20 reads in at least 90% of the 128 samples. Annotations were made using the ANNOVAR package (2017Jul16) [23] with additional databases (dbSNP150, gnomAD exome and genome collections, CADD version 1.3). The Genome Aggregation Database (gnomAD) was used to obtain minor allele frequencies (MAF) for all quality-passed SNVs (NFE or Non-Finnish European children were used as background population) [24].
As per the previous analysis of the coding variants in the same ASD cohort [7], we used a two-stage filtering process to identify the most interesting SNVs. First a Minor Allele Frequency filter (MAF < 0.01) was applied to select the rare variants, which were thought to have a larger effect in complex disorders such as ASD [7,25,26], then a 'functional' score (CADD score ≥ 15) [27] was used to prioritize the potential loss-of-function SNVs.

Computational analysis of the regulatory regions
We defined the regulatory SNV set as those situated in promoters, transcriptional start sites, enhancers, microRNA genes, putative microRNA regulatory elements (MREs) on the 3′UTRs, and TFBS on the promoters, 5′UTRs and 3′ UTRs of the protein-coding genes.
Promoter regions were defined using FANTOM5 phase 1 TSS (transcription start site) CAGE data (Cap Analysis of Gene Expression) as regions 1000 bp upstream and 300 bp downstream of the annotated CAGE peaks for Human Genome Nomenclature Consortium (HGNC) gene records [12]. The enhancer region coordinates were downloaded from the FANTOM5 phase 1 database [28]. Genomic annotations of the non-coding regions are from the University of California Santa Cruz (UCSC) 'knownGene' annotation (downloaded 21 February 2013) on the hg19human reference genome including all transcript variants.
Regions of motifs with low information density were trimmed out with TAMO (trimming threshold 0.2) [41]. To estimate the effects of the microRNA gene variants in terms of loss or gain of targets, we used miRanda [43] (default parameters, free energy < = −18) to predict the target sites for microRNA sequences incorporating their respective variants across all known 3′UTR sequences available in the TargetScan database v7.0 [46].
We also calculated enrichment of the regulatory SNVs (via hypergeometric distribution one-tailed test in R) in coexpression modules constructed and characterized by Parikshak et al. [2].
Biotinylated microRNA-mRNA pulldown transcriptome analysis Biotinylated microRNA-mRNA pulldown experiments were used to identify bound mRNA transcripts of the miR-873-5p wild-type (WT) and the variant miR-873-5p (Mut). Briefly, biotinylated microRNAs were transfected into the undifferentiated SH-SY5Y cells (a human neuroblastoma cell line), and captured along with their targeted transcripts to magnetic streptavidin beads (Invitrogen Dynabeads M-280).
The pulldown protocol was primarily drawn from Wani and Cloonan [49], as well as Cristino and Barchuk [50], with the following modifications. The samples were transfected in a 75 cm [2] flask with 560 pmol of biotinylated microRNA with lipofectamine 2000 (Thermo Fisher), and grown for 24 h before pulldown. Four biological replicates, from four independent transfection and pulldown experiments, were made for each microRNA (WT and Mut). RNA was purified with RNAeasy columns (QiAGEN), and sequencing libraries were prepared from 100-ng input RNA with Illumina TruSeq Stranded mRNA kit. Libraries were sequenced across two lanes using v4 SBS chemistry on an Illumina HiSeq2000. Cells were grown in DMEM F12 with HEPES media with penicillin/streptomycin at 37°C with 5% CO 2 . The synthetic microRNA duplexes with biotin tags on the 3′ end of the mature microRNA were ordered from IDT (Supplementary Table 1).
The efficiency of transfection was checked quantitatively through staining. Then cells were fixed in 4% PFA, rinsed with PBS, and blocked with 2% BSA in 0.3% Triton X-100 (TX) for 30 min at room temperature for 2 h. The cells were incubated with Alexa Flour 594 streptavidin (Illumina) in BSA/PBS/TX mix for 1 h at room temperature on rotation, followed by 10 min of DAPI incubation.

Annotation, and filtering the genetic variants in the regulatory regions
In this study, we have used the previously published WES data [7] to investigate putative regulatory genetic variation in 128 individual genomes from an Australian ASD cohort (48 affected cases and 80 parents). To examine the captured non-coding regions, we developed a computational pipeline to systematically analyze WES data in non-coding regulatory contexts: proximal promoter regions, transcription start site (TSS), untranslated regions (5′ and 3′UTRs), TFBS, MRE, and miRNA genes (Fig. 1).
We assessed the WES coverage across the different noncoding regions (Supplementary Figure 1). There was sufficient read depth for the call-variants across 20-24% of all predicted TFBS and MREs in the untranslated regions (Table 1). Fortuitously, despite the limitations of exome capture design in the non-coding regions, the actual sequence coverage is highest in the regions proximal to exome-capture gene targets, where the most potentially interesting cis-regulatory elements would be expected (Supplementary Figure 1). The MiRNA genes were wellcovered (48%), and although the promoters had less callable coverage (13%), they still yielded valuable data ( Table 1). The enhancers had very low callable coverage (1%) due to their distance from the captured exonic regions, and were omitted from this analysis (Table 1).
After filtering the SNVs for rare population frequency (Minor Allele Frequency ≤1% based on gnomAD exome and genome collections) and functional annotation (CADD score ≥15), each individual case had an average of 288 protein-coding and 299 regulatory variations (Supplementary Figure 2; Supplementary Table 2). Greater numbers of the SNVs were identified in protein-coding regions and TFBSs in the promoter regions, followed by UTR TFBSs and 3′UTR MREs, reflecting the relative sizes of these regions ( Table 1).
The genome-wide annotation of regulatory regions, and TFBS PWM (Position Weight Matrix) definitions in TAMO format [41]

Association of the rare variants with the ASD-risk genes
We used a standardized score (Z-score) of the binomial distribution test to assess whether the genes with rare   Table 3). Only genes found in the Human Genome Nomenclature Consortium (HGNC) database and the whole proteinprotein interaction network [30] were considered as the background datasets used for Z-score calculations (Supplementary Table 3) protein-coding and non-coding regulatory SNVs found in our ASD cohort were over-represented in the ASDassociated genes. In a previous study [7], we have shown that the rare protein-coding SNVs identified in our ASD cohort were significantly enriched in ASD-associated pathways (AXAS-ASD) [30]. Herein, we have modified our pipeline to identify the putative loss-of-function SNVs in both protein-coding and non-coding regions, and also to reduce filtering steps biases (MAF < 0.01 and CADD score ≥15; Supplementary Table 2). Strikingly, the regulatory SNVs identified in ASD cases were significantly enriched in the ASD-associated datasets (AXAS-ASD, SFARI, M13-Li2014, and GWAS-ASD; Fig. 2 and Supplementary Table 3), but not in the control datasets-coronary artery disease (GWAS-CAD) and Crohn's disease (GWAS-Crohn; Fig. 2 and Supplementary Table 3).
Interestingly, the SNVs identified in the protein-coding regions showed the highest enrichment in SFARI gene list, while the SNVs in the non-coding regulatory regions are highly enriched in the AXAS-ASD-gene network ( Fig. 2; Supplementary Table 3).
Having seen an overall ASD association, we then examined whether the regulatory mutations were overrepresented in particular pathways or regions during neurodevelopment. Parikshak et al. [2] have previously constructed and characterized a set of co-expressed gene modules with respect to ASD. We saw significant enrichment of genes having regulatory mutations in ASD cases against two modules associated with regulation of gene expression (M3) and early synaptic development (M16; Supplementary Table 4), at late fetal and early-postnatal timescales [2]. showing the variant in individual f34-s1, relative to the conserved seed motif generated from predicted miR-873-5p regulatory elements with MEME [45]. b Venn diagram showing the overlap between genes significantly enriched in the miR-875-5p wild-type (WT) and mutant (Mut)-pulldown assays. c Percentage of the SFARI genes in genes specifically enriched in the miR-875-5p wild-type (WT) (n = 905) or mutant (n = 925) pulldownsgenes enriched in both pulldowns are not included. P-value calculated with chi-squared test between the groups, p-value = 0.044, chi-squared statistic = 4.06, and degrees of freedom = 1. Error bars represent 95% confidence interval. Only genes present in both pulldown assays at high enough levels to have been contrasted (p-value calculated) by DEseq2 in both WT and Mutant experiments were included, omitting three genes that were significantly enriched in Mut but uncertain in WT. d Rare protein-coding and regulatory variants found in case f34-s1 overlap the AXAS-ASD network [30] and the miR-873-5p target genes (shown as diamonds, see Supplementary Table 6 for more details). The overlapped network shown is enriched in functional pathways associated with synaptic vesicle cycle and synaptic transmission (Supplementary Table 7). e Family f34 case study showing direct interaction of several synaptic proteins in the neurexin-neuroligin axis, in the context of the maternally inherited putative loss-of-function NRXN1 coding and the paternally inherited miR-873-5p variants in individual f34-s1. Putative target genes of miR-873-5p are shown in blue. (*) SFARI genes Effect of the rare microRNA variants in ASDassociated gene networks We identified a total of four microRNAs with rare SNVs within their mature sequences, which could globally disrupt their mRNA binding specificity (Supplementary Table 5). Of these, one variant was in the conserved "seed" regions of the microRNAs, where there is greater potential for disruption due to the exact target complementarity required in canonical microRNA-mRNA binding (Supplementary Table 5). We hypothesize that miRNA mutations within the mature sequences will affect the microRNA-mRNAbinding affinity and change the regulation of several target genes. We found miR-873 gene had a SNV within the seed region of the mature miR-873-5p (Fig. 3a) sequence. Computational predictions of miRNA binding suggest that the mutant miR-873 loses 76% of the predicted target genes, while also gaining 74% potential new target genes. This is the greatest predicted loss-or gain-of-binding of any of the miRNA variants (Supplementary Table 5). In contrast, the impact of variants that lie in less-conserved positions outside the seed regions (e.g., miR-411-5p, miR-668-3p and miR-323b-5p; Supplementary Table 5) would be less disruptive.
We propose that this rare miR-873-5p seed mutation could be relevant to ASD-associated genes and pathways, due to miR-873's expression and genomic context. The miR-873 gene is embedded in the intronic region of the LINGO2 gene (second intron of LINGO2 variant 2). Examples of LINGO2 copy number variation (CNV) have been seen previously in ASD cases [62][63][64][65], including two instances where CNV knocks out the 5′ region of LINGO2 variant 2 including the miR-873 gene [66,67]. Furthermore, LINGO2 has been shown to exhibit a neural-specific expression pattern in early development in mice [68]. Although the host gene expression is not an absolute indicator of embedded microRNA context, it may be concordant, and miR-873-5p mature sequence itself has neuronal expression in the human brain (Supplementary Figure 3) [69].
Biotinylated microRNA-mRNA pulldown experiments were performed to identify the transcriptome-wide binding profiles of wild-type and mutant miR-873-5p (Fig. 3b) with 3226 and 3249 significantly enriched targeted genes identified respectively (Supplementary Table 6). By contrasting the relative pulldown assay enrichment of the wild-type and mutant miR-873-5p, we identified 2321 overlapping genes putatively targeted by both wild-type and mutant miR-873-5p, while 905 and 928 genes were specifically enriched in the wild-type and mutant microRNA-mRNA pulldown, respectively (Fig. 3b).

Convergence of the rare regulatory and the proteincoding variants in synaptic pathways
We chose to follow-up on the effect of the SNV within miR-873-5p found in individual f34-s1 due to its location in the seed region (Fig. 3a), the microRNA's robustly expressed profiles across all regions of the human brain (Supplementary Figure 3), and functional enrichment analysis of its predicted target genes pointing toward a role in the regulation of nervous system development (Fisher's exact test with Bonferroni step down corrected P-value = 0.002) and signaling (Fisher's exact test with Bonferroni step down corrected P = 0.0003) [60].
However, the miR-873-5p SNV is heterozygous and inherited from an unaffected father, so although not causative on its own, we hypothesized that it could contribute additively through interactions with other small-effect mutations inherited from parents, as well as de novo mutations. In our published whole-exome screen, the family members were assessed for a 'broader autism phenotype' (BAP), reflecting a subclinical expression of ASD phenotypes [7,72]. Among parents of this proband, the mother was recorded as having a BAP [7]. In this study, we focused on the maternally inherited NRXN1 variant, as it is one of the most highly connected proteins in the case f34-s1-gene network (Fig. 3d) and a key ASD-risk gene involved in synaptic transmission [70]. Notably, in case f34-s1, few variant genes overlapped in both the AXAS-ASD and SFARI datasets and are well known players in the synaptic pathways (i.e., NRXN1, CADPS2, CNTNAP2) and regulation of gene expression (i.e., CTNNB1, FOXP1) previously associated with neurological disorders including autism [71,73]. Three of these gene variants (NRXN1, CNTNAP2, and FOXP1) are inherited from the BAP mother, while two are de novo mutations (CADPS2 and CTNNB1). Nevertheless, there will be additional contribution of several other small effects inherited and de novo variations in both proteincoding and regulatory regions of genes involved in the functional pathways associated with synaptic vesicle cycle and transmission (Fisher's exact test with Bonferroni step down corrected P = 0.001 [60]; Supplementary Table 7).
The rare variation in NRXN1 genes (G ≥ A mutation at chr2:50847195; rs78540316) results in a Pro429Ser structural change in LNS domain 2 of NRXN1α (UniProtKB -Q9ULB1 [74]). The extracellular domain structure of NRXN1α and other alpha neurexins is characterized by the linear assembly of six LNS domains, with the L6 LNS domain extended to form an "L" shape; this assembly has been shown to be important for their function as synaptic organizers [61]. Continuous electron density of bovine NRXN1α (PDB: 3QCW) is visible for L2-L6 LNS domains showing a concatenated arrangement of these domains mediated by β11-β12 loops, which play an important role in the interaction between LNS domains [61] (Fig. 4a). The packing of this loop in the interface of L2-L3 LNS domain involves insertion of Pro436 into a hydrophobic pocket of NRXN1α (Fig. 4b). Pro436 is highly conserved (Supplementary Figure 4), supporting its important role for the structural arrangement of NRXN1α. Analysis of the putative effects of replacing this conserved proline residue with serine on the stability of NRXN1α was undertaken with the FoldX force field, which has been shown to accurately predict the effects of mutations on the free energy of unfolding of proteins [75]. This revealed that the mutation is moderately (1.81 kcal/mol) to severely (3.18 kcal/mol) destabilizing, depending on whether Ser436 can hydrogen bond to the disordered lysine residue (Lys336), due to the loss of these stabilizing hydrophobic interactions (Fig. 4c). Given the importance of Pro436 to the correct packing of the L2-L3 LNS domains, the mutation of this conserved proline residue to serine (equivalent to Pro429Ser mutation in the human NRXN1) will most likely result in loss-ofstability of NRXN1α and disruption of the L2-L3-LNS domains interface. Thus, the predicted destabilizing effects of the Pro429Ser mutation provides a plausible structural change that will affect the structural integrity and thus the function of NRXN1α.
Considering this structural evidence, we propose that putative loss of NRXN1α protein function may additively contribute to mutant miR-873-5p dysregulation of the NRXN2 expression (Fig. 3e) and increase the threshold of genetic liability regarding their function in pre-synaptic transmission and synapse development in case f34-s1. Fig. 4 Structural modeling of the Pro429Ser NRXN1α mutation using the crystal structure of bovine NRXN1α (PDB: 3QCW). a The β11-β12 loops comprise of a large part of the L2-L3-LNS domain interfaces in NRXN1α. In the L2 LNS domain, the loop is characteristically bent in a horseshoe type conformation, with two proline residues Pro433 and Pro436 (magenta: equivalent to human Pro426 and Pro429) stabilizing the tightly kinked structure. b The packing of this loop in the interface involves insertion of Pro436 into a hydrophobic pocket of NRXN1α. c The Pro436Ser replacement (equivalent to Pro429Ser in humans) analyzed with the FoldX force field revealed this mutation is moderately (1.81 kcal/mol) to severely (3.18 kcal/mol) destabilizing, depending on whether Ser436 can hydrogen bond to a disordered Lys336 Moreover, there may be further impact at the synapse, as the miR-873-5p-pulldown assay indicated the capability to bind a number of other ASD-risk genes. MiR-873-5p binds to key genes for synapse formation and function including all three SHANK family members (SHANK1, SHANK2 and SHANK3), three Neuroligin family members (NLGN2, NLGN3 and NLGN4X), Discs Large MAGUK Scaffold Protein 4 (DLG4 also known as PSD95), and two DLG associated proteins (DLGAP3 and DLGAP4), which have been linked to ASD by multiple studies [47,76,77] (Fig. 3e, Supplementary Table 6).

Discussion
Our analyses provide a novel and comprehensive genomewide approach for identification and characterization of putative regulatory variations in an exome-sequenced-ASD cohort. By combining our computational TFBS and MRE predictions with cis-regulatory elements resources, we showed there was an overall cohort-level enrichment of the regulatory variants within the ASD-associated genes, derived from several gold-standard ASD datasets (Fig. 2) [30,47,48]. We specifically focused on a rare mutation observed in the miR-873 gene, and characterized its potential impact in the regulation of the synaptic genes.
ASDs have complex and heterogeneous genetic causes, and may arise from combinations of inherited and de novo variation in the protein-coding [4] and non-coding regions [78,79]. To investigate a possible additive contribution of the regulatory SNVs, it therefore makes sense to focus on the clusters of variants, both regulatory and protein-coding, that are in the functional neighborhood of known ASDassociated gene networks and functional pathways [30]. Herein, we identified the rare loss-of-function variants in the regulatory regions potentially disrupting the expression of the "peripheral" genes incurring in non-zero effects on regulation of the "core" genes with more direct and specific roles in disease etiology. Our findings support the omnigenic model of complex disease [80], which proposes that any regulatory variation in the genes expressed in diseaserelevant cell-types can contribute to disease etiology with non-linear effects on risk for that disease/disorder.
There have been several studies that have identified the functional gene modules involved in ASD-associated biological processes [2,3,48]. Two of the genome-scale coexpression modules constructed by Parikshak et al. [2] were enriched in our regulatory variant set; notably one of the modules was the same one (M16) that they had highlighted for the potential involvement in ASD. It is enriched for ASD relevant GO terms such as "synaptic transmission" and "homophilic cellular adhesion" and more specifically expressed in later fetal development and early infancy [2].
The hypothesis of Parikshak et al. [2], that M16 might be targeted by lower-risk inherited variants, in contrast to other modules involved in more fundamental early developmental processes, is supported by enrichment of our dataset of regulatory inherited mutations.
Our study identified four variants found within the miRNA genes that were predicted to disrupt gene regulation. Notably, we identified a rare variation in the seed region (the major factor in binding specificity) of miR-873-5p and functionally verified the changes in binding affinity of the miR-873-5p variant. MiR-873 is of particular interest as it is located within a neural gene LINGO2, which has been investigated with respect to essential tremor and Parkinson's disease [81]. There are also a number of reported CNVs of LINGO2 found in ASD cases [62,63,64,65], including one which overlays miR-873 [66,67]. Furthermore, miR-873 promoter expression has also been found to be significantly enriched in human brain cells and tissues (Fantom5 miRNA Atlas; P-value = 2e-40) [82]. It is still unknown whether these particular rare SNVs on miR-873-5p (rs777143952; MAF < 6.6e-05) will be statistically overrepresented in ASD cases, and only very large wholegenome sequencing (WGS) studies would have the necessary power to assign any statistical association to a specific rare SNV. However, the relevance of the rare variants outside protein-coding regions, for human complex traits and phenotypic variance, has only started to be explored in developmental disorders by recent studies using WGS data from the large samples sets of European ancestry [78,79].
The binding profile of miR-873-5p measured by the pulldown assay includes several ASD-risk genes-all three members of the SHANK family (SHANK1, SHANK2, and SHANK3), several neuroligins (NLGN2, NLGN3, and NLGN4X), and neurexins (NRXN2 and CNTNAP2). This targeting information, combined with the neural expression of miR-873-5p and its host gene LINGO2, strongly suggest a potential regulatory role in synapse development and function. We also found that the miR-873-5p variant binds to the transcription factor FOXP1 transcript, suggesting a dysregulation of FOXP1 expression, which has previously been shown to play a critical regulatory role in CNTNAP2 expression [83], as well as a functional role of striatal pathways important for vocal communication [84]. Despite this miRNA mutation being inherited from an unaffected parent, the measurable loss-of-binding in several ASD-risk genes leads us to postulate that it may contribute additively to ASD risk, particularly when combined with the mutation load from the mother with recorded BAP. Consistent with this idea, is the convergence of functional disruption of neurexins in this individual-specifically an observed partial change in NRXN2 and CNTNAP2 binding affinity from the paternally inherited miR-873-5p mutation when combined with the maternally inherited NRXN1 and CNTNAP2 coding variants. Moreover, this ASD case (f34-s1) carry two additional de novo protein-coding variations in CADPS2 gene that are involved in neurotrophin release and interaction with dopamine receptor type 2 [85], and CTNNB1 gene associated with impaired social interactions and repetitive behaviors [86]. Considering the above evidence concerning pathways and processes associated with ASD (e.g., synapse development), this work showcases the importance of compounded effects of protein-coding and regulatory mutations in a biological systems framework, particularly in the case of de novo mutations, which contribute significantly to ASD risk [10].
We are aware of the limitations of using a human neuroblastoma cell line (SH-SY5Y), in which the transcript levels and alternative splicing isoforms might not necessarily capture the same molecular interactions as CNS neurons, and we are mindful that spatial and temporal gene expressions patterns of various brain cells and tissues have been shown to also vary. Nevertheless, our study is the first to propose an integrated approach to predict and validate a mechanistic hypothesis for a possible causal effect of a single nucleotide variation changing the binding affinity between miRNA and target transcripts. Hence, our approach still represents a significant and novel contribution to systematically evaluate the (under the same SH-SY5Y transcriptome background) loss or gain of affinity effects of the genetic variants in regulatory regions, such as miRNA genes.
We are also mindful that our analyses are focused on rare loss-of-function heterozygous DNA variations that are represented by a minor allele frequency of < 1% in the human population. We are similarly aware of the ongoing debate as to the relative contribution of the rare inherited alleles to ASD and that there exists an uncertainty associated with current estimates [87]. Much of this uncertainty surrounds the contribution of loss-of-function variations found in non-coding DNA that comprises majority of the genome. Arguably, in the absence of detailed wholegenome analysis, estimates concerning the contribution of the rare inherited non-coding variants to ASD remain speculative. The whole-genome sequencing approaches when combined with rigorous clinical and behavioral profiling of family cohorts and functional characterization of DNA variations are now providing the means to further identify and evaluate rare and de novo non-coding regulatory variations that may affect brain development [4,6,88,89], yet we see that even exome-based data can yield useful information. A major confound regarding estimates of heritability surrounds the impact of epigenetic processes in the penetrance of heterozygous alleles. Although it is difficult to the access neural tissue, there are attempts to interpret global epigenetic events to identify patterns of methylation and other DNA modifying chemistries, and assess the regulatory significance of non-coding DNA variations occurring at these sites. Interestingly, Yuen et al. (2016) recently reported that de novo regulatory variations associated with DNMT3A and ADNP are likely to more directly affect the global methylation processes. With regard to the epigenetic mechanisms, any new data resource detailing non-coding sequences that control gene regulation or the stability of transposons and repetitive DNA that affect transcription, can logically be incorporated into this analysis pipeline to more completely account for other regulatory layer of DNA variations associated with ASD.
Public genomic resources are constantly improving, larger populations of SNVs will aid the identification of the rare variants and projects such as psychENCODE [90] and CommonMind consortium [91], aimed at generating detailed datasets of regulation in the brain, can only improve interpretations of the variants in the regulatory regions. The approach described here is equally applicable to the whole-genome sequencing data and to the evaluation of the variants identified in the genome-wide association studies; moreover it provides a generalized template for analysis of complex genetic disorders. When combined, more ASD-cases-sequence data, new genomic resources, and integrated systems-based approaches that examine the regulatory variants in ASD-associated genes and pathways, as identified in this study, should elucidate the role of the non-coding regulatory variants and their additive contribution to the risk of ASDs.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/.