Introduction

The search for genetic factors predisposing to disease has traditionally focused on the study of protein-coding sequences. Nevertheless, increasing evidence indicates that genetic variation in regulatory regions could be a major contributor to phenotypic diversity in human populations.1 In the case of psychiatric disorders, changes in regulatory elements leading to small variations in the dosage of proteins involved in neuronal pathways may have an important role in fine-tuning complex brain functions, and contribute to the development of these disorders. Recently, microRNAs (miRNAs) have emerged as important genomic regulators with a key role in the development and in the adult nervous system, contributing to the correct calibration of neuronal gene expression.2

miRNAs are a large class of single-stranded small noncoding RNAs of 19–25 nucleotides in length in their mature form that act as posttranscriptional regulators of gene expression by either mRNA degradation or translational repression.3 The recognition of target mRNAs is mediated by the complementarities between miRNAs and the nucleotidic sequence of target mRNAs. However, the most critical region for target recognition consists of nucleotides 2–7 of the miRNA sequence that is known as the seed region.4 miRNAs themselves are the final product of a multistep maturation process that starts with the generation of a transcript referred to as the primary miRNA (pri-miRNA) that hosts one or more miRNA precursors with a characteristic hairpin structure. Most pri-miRNAs are transcribed by RNA polymerase II and undergo capping, splicing and polyadenylation as regular mRNAs. miRNA genes can be either intergenic or located within protein-coding host genes, usually in introns, and can be processed from the mRNAs of their host genes.5, 6 Since the discovery of the two first miRNAs, lin-4 and lin-7 in Caenorhabditis elegans,7 hundreds of miRNAs in animals, plants and viruses have been identified and annotated in the miRBase sequence database (http://microrna.sanger.ac.uk/). Recent estimates indicate that miRNAs regulate at least 30% of all protein-coding genes, building complex regulatory networks that control almost every cellular process.3 In fact, deregulation of miRNA regulatory pathways has already been involved in human disorders such as cancer or fragile X syndrome.8, 9

Single-nucleotide polymorphisms (SNPs) located within miRNA target sites have been shown to affect the expression of the target gene and contribute to susceptibility to human diseases.10 Although many reports have corroborated the link between sequence variants in miRNA binding sites of target genes and complex diseases and phenotypes,11, 12, 13, 14, 15 so far, only one common functional variant in a miRNA gene has been associated with disease: a C/G polymorphism (rs2910164) located in hsa-mir-146a, which has recently been found to contribute toward a genetic predisposition to papillary thyroid carcinoma.16 Indeed, allelic changes and genomic variants involving either miRNAs or their regulatory machinery may be important sources of phenotypic variation and contribute to the susceptibility for complex disorders. Although poorly considered until now, association studies using SNPs in miRNA genomic regions might help to evaluate the involvement of miRNAs in disease. With this aim in mind, we analyzed the genomic distribution and genetic variation of miRNA-containing regions and constructed a panel of SNPs suitable for the study of complex disorders.

Materials and methods

Analysis of the genomic organization of miRNAs

The sequences and genomic coordinates of human miRNAs (miRBase, release 7. 1 and miRBase, release 13.0) were obtained from the miRNA registry (http://microrna.sanger.ac.uk). Genomic locations and human genome annotations were obtained from the UCSC human genome browser assembly from March 2006 build 36, hg 18.

Pathways analysis

Enrichment in biological functions, canonical pathways and molecular networks for miRNA host genes was analyzed using the Ingenuity Pathway Analysis (IPA) Software version 6.3. (www.ingenuity.com) and the statistical significance of associations was calculated using the right-tailed Fisher's exact test.

SNP selection

For the selection of tagged-SNPs, we used the HapMap project data set (HapMap Data Rel 19/phase II October 2005, on NCBI B34 assembly, dbSNP 125) using genotypes that correspond to the 60 individuals from the CEPH-30 – trios of European descent (http://www.hapmap.org). Only SNPs having a minor allele frequency (MAF) higher than 5% were considered for further analysis. Bins of common SNPs in strong linkage disequilibrium (LD), as defined by an r2 value higher than 0.80, were identified within this data set by using haploview v3.32 software17 and the ‘LD Select’ method to process HapMap genotype dump format data corresponding to the selected regions. A total of 710 tagged-SNPs were defined using the tagger implementation in haploview. To saturate the miRNA regions, 58 additional SNPs were selected from dbSNP (dbSNP 125) or Perlegen (http://genome.perlegen.com/) because of their location either within miRNA sequences or in the 2 kb nearby miRNA regions, with no restrictions on MAF or validation status (Supplementary Table 1).

DNA samples

DNA samples were obtained from 340 healthy blood donors recruited from the Blood and Tissue Bank of the Catalan Health Service; all were of Spanish origin (Catalonia, at the northeast of Spain) and gave an informed consent. Genomic DNA was isolated from peripheral blood lymphocytes using automatic DNA extraction and standard protocols.

Population admixture

To detect population admixture in our control sample, a structured association method was used to further test each sample set for stratification between cases and controls, as previously described.18 No allelic differences among the individuals from the Spanish population were observed and the highest log likelihood scores were obtained when the number of populations was set to 1.

Genotyping of the miRNA SNP panel

The selected 768 SNPs were genotyped using the Golden Gate assay on an Illumina BeadStation 500G (Illumina, San Diego, CA, USA) in accordance with the manufacturer's standard recommendations. This technology is based on allele-specific primer extension and highly multiplex PCR with universal primers, as reviewed by Syvanen.19 Allele calling was performed using the BeadStudio program (Illumina Inc). A total of 19 HapMap individuals including 6 trios and 1 duplicated DNA sample were genotyped and used to help in the clustering and as a control of the genotyping process. The genotyped controls included 340 individual samples and 2 duplicated DNA samples. All SNPs were examined for standard quality control after genotyping; this evaluation resulted in the elimination of a total of 54 SNPs, from which 31 were excluded because of low signal and 23 were eliminated because of poor clustering. These exclusions yielded a final cleaned data set of 714 SNPs that were typed (92.97%). Genotypes for the nonexcluded SNPs were consistent with Hardy–Weinberg equilibrium (HWE), except for two SNPs that were eliminated (Supplementary Table 1). Both genotype concordance and correct Mendelian inheritance were verified; one sample was eliminated because of gender incoherencies in several SNPs in chromosome X.

Analytical methods

Minor allele frequencies were estimated for the genotyped Spanish subjects and were compared with those estimated by different HapMap populations (on the basis of 60 European (CEU), 60 Chinese (CHT), 60 Japanese (JPT) and 60 Yoruba (YRI) individuals) using Pearson's χ2-test. Pearson's correlation coefficient, R2, was used to measure correlations in allele frequencies between samples by taking into account the sizes of samples. One sample t-test was also used to test whether the Spanish subjects sampled had allele frequencies equal to those published by HapMap. An adjusted P-value threshold of 0.0000712 was used on the basis of 702 independent loci according to Bonferroni's correction for multiple testing.

Results

Genomic distribution of the whole collection of miRNAs

To select the miRNA genomic regions to be covered by the SNP panel, we first studied the genomic distribution of 325 human miRNA genes (miRBase, release 7. 1) with regard to their aggregation in clusters, as well as their location in relation to other transcriptional units. The analysis of miRNA distribution within chromosomes showed that miRNAs have a strong tendency to aggregate, with 111 miRNAs (34%) being located at distances of <1 kb from other miRNAs, and more than half of the miRNAs (169 miRNAs) being <4 kb apart from other miRNAs (Figure 1a). Taking these observations into account, we defined miRNA clusters as genomic regions containing at least two contiguous miRNAs with an interdistance of <4 kb. However, and overruling these criteria, we also considered that if a miRNA was located within the next 7 kb of an already assigned miRNA cluster (no miRNAs were found at interdistances from 7 to 10 kb), this miRNA also belonged to this cluster. Finally, we also considered that two miRNAs belonged to the same cluster if they were located in the same transcriptional unit, such as the same gene, independently of distance criteria. Following these criteria, 60% (194 out of 325) of miRNAs were organized into 48 clusters spanning 405 kb of genomic DNA. Conversely, 40% (131 out of 325) of them were isolated (Figure 1b, Supplementary Table 2). Although the median number of miRNAs per cluster is two, some clusters contain a large number of miRNAs; a remarkable case is that of two large clusters on chromosomes 14 and 19 containing 24 and 43 miRNAs, respectively (Table 1).

Figure 1
figure 1

Genomic localization of the whole collection of miRNAs (miRBase, release 7.1). (a) The number of miRNAs located closer than a given distance from other miRNAs, in kb, is plotted. (b) Distribution of miRNAs according to their location in relation to other transcriptional units and their aggregation in clusters. Clustered and isolated miRNAs are classified depending on their genomic localization in relation to RefSeq genes, mRNAs, predicted genes and ESTs.

Table 1 Summary of the characteristics of miRNA clusters

We also analyzed the localization of miRNAs with regard to other transcriptional units annotated at the UCSC genome browser. We found that 37% of miRNAs (119 out of 325) were located in known protein-coding genes from RefSeq (NCBI), although only 96 were located in the same orientation of the host gene. According to the criterion of inclusion inside known genes, the other 206 miRNAs (63%) could be considered intergenic; however, when taking into account other transcriptional units annotated at the UCSC genome browser, such as mRNAs from GeneBank, Aceview or Ensembl-predicted genes and ESTs, only 99 of the 325 miRNAs were in fact purely intergenic (30% Figure 1b). From the 96 miRNAs located in the same orientation of 77 host genes, most were located in introns and only 3 miRNAs (hsa-mir-22, hsa-mir-155 and hsa-mir-198) were in exon–intron boundaries or in untranslated gene regions (Supplementary Table 2). We analyzed the association of the set of host genes containing miRNAs with a given biological process or pathway using the IPA software. The program was interrogated for enrichment in biological functions, canonical pathways and molecular networks, and the statistical significance of associations was calculated using the right-tailed Fisher's exact test. As shown in Figure 2, some of the most significant associations for miRNA host genes with biological functions were found with several disorders in relation to neurological, psychological or nutritional disease; significant associations (−log(P-value)>3) were also found with carbohydrate metabolism, molecular transport and small molecule biochemistry. When the analysis was repeated using data from the last miRbase release (13.0, March 2009), the results obtained were in general very similar and, noticeably, the power of associations increased for neurological and psychological disorders (Supplementary Table 3). As far as canonical pathways are concerned, the most significant associations were found with pathways involved in pantothenate and CoA biosynthesis and GABA receptor signaling. Finally, we also analyzed the molecular networks in which these host genes containing the miRNAs of the SNP panel interact; the highest score was for a gene network involving 33 miRNA host genes related to gene expression, neurological disease, skeletal and muscular system development and function (Figure 2).

Figure 2
figure 2

Association of miRNA host genes with biological processes or molecular networks according the Ingenuity Pathway Analysis software. (a) The five most significant associations of host genes with different categories of biological functions are shown. (b) Diagram of the molecular network showing the highest score; it is associated with functions on gene expression, neurological disease, skeletal and muscular system development and function (miRNA host genes included in the network are represented as gray-filled shapes).

Selection of SNPs in miRNA regions

For the selection of the panel of SNPs, we considered 1 Mb of genomic DNA corresponding to 131 isolated miRNAs and 48 miRNA clusters; the selected region also includes a flanking region of 5 kb upstream and downstream of the specific miRNA or miRNA cluster. Before the selection of SNPs, we studied the SNP coverage on miRNA sequences according to the dbSNP database (dbSNP 125). We could only map 24 SNPs within miRNA sequences (Table 2). This represents a density of 0.86 SNPs per kilobase (24 SNPs per 27.7 kb) at miRNA regions compared with the observed SNP density of 3.99 SNPs per kilobase for the rest of the genome (11.96 × 106 SNPs per 3 × 106 kb). Overall, 93.3% of human pre-miRNAs had no reported SNPs and only 2 of the observed SNPs were located in the mature miRNA region, rs34059726 in hsa-mir-124a-3 and rs12975333 in the seed region of hsa-mir-125a. Owing to this low SNP density at miRNA regions and for an optimal selection of informative SNPs, we combined a classical tagged-SNP approach (r2=0.8, MAF>0.05) with the selection of other SNPs according to its putative functional relevance, using information from the European population panel of Hapmap (release 20, Phase II). Finally, the panel included a total of 768 SNPs (Supplementary Table 1), from which 576 were SNPs tagging miRNA gene regions, 19 were SNPs located in miRNA sequences (5 out of the 24 SNPs within miRNAs were not included because of technical incompatibilities), 39 at a nearby miRNA location (independently of their MAF or validation status) and 134 were SNPs tagging the promoter regions of miRNA host genes. The latter were included to map the genomic regions involved in future possible associations more precisely, to take into account regions that may putatively be involved in miRNA biogenesis (genic miRNAs) and for the interest of these genes per se, according to the association found with them and with gene networks related to neurological disease.

Table 2 SNPs located in miRNAs and allele frequencies in a Spanish population

Genotyping and applicability of the miRNA SNP panel in a Spanish population

A Spanish control sample formed by 340 Spanish unrelated individuals was genotyped using a custom Golden Gate assay from Illumina. Three out of the 19 genotyped SNPs located in miRNA sequences failed in the genotyping. Analysis of the allele frequencies of the other 16 miRNA SNPs showed that 9 of them (56.25%) are monomorphic (Table 2). Next, we studied the applicability of our miRNA SNP panel, constructed on the basis of information regarding genetic variability of the European population (CEU) of the HapMap database, to the study of complex diseases among the Spanish population. Allele frequencies were calculated after confirmation of HWE, and MAFs were subjected to pair-wise comparisons between the Spanish and the HapMap CEU, as well as Asiatic and Yoruba populations. Comparisons were carried out for the 702 SNPs out of the 768 SNPs of the panel for which genotyping information on HapMap populations and on our population was available (Figure 3). When the MAFs of the Spanish sample were tested against the allele frequencies of the other three population samples, the results were very consistent, and a high positive correlation (R2) between the Spanish and CEU samples was observed (R2=0.864, P1 × 10−6). Conversely, we observed low correlations between the allele frequencies of the Spanish and the Asiatic (R2=0.247), and the Spanish and the Yoruba (R2=0.155) samples. In the case of 36 SNPs (5.12%), the less frequent allele in the CEU HapMap population was found to be the more frequent allele in the Spanish sample (points above the 0.5 horizontal dotted line in Figure 3). Further, we compared the allele frequencies between the CEU Hapmap and the Spanish populations using a Pearson χ2-test. According to this, allele frequencies for 129 SNPs showed to be significantly different between both populations (P<0.05), although when the results of comparisons were corrected for multiple testing (702 independent loci, P<7.12 × 10−5), only allele frequencies for 4 out of the 702 analyzed SNPs remained significantly different between the Spanish and CEU HapMap samples (Table 3). Furthermore, these four SNPs, three located in the same genomic region corresponding to hsa-mir-128-1 (within an intron of R3HDM1, R3H domain containing 1) and one located in the region corresponding to hsa-mir-26a2 (within an intron of CTDSP2, nuclear LIM interactor-interacting factor 2), also showed a strong geographical genetic variation among the Yoruba, the Asiatic and the CEU populations from HapMap (Table 3).

Figure 3
figure 3

Pair-wise comparisons of the allele frequencies for 702 SNPs between the Spanish population (CAT) and the CEU, YRI and CHB+JPT populations from Hapmap. The MAFs of the CAT sample were tested against the allele frequencies of the other three population samples.

Table 3 SNPs with significant allele frequency differences between the Spanish and the European HapMap populations

Discussion

Genome-wide association studies using SNP genotyping constitute the standard approach for identifying the genetic component underlying complex traits. The HapMap Project has generated a bulk of genetic information that has become essential for genotyping purposes,20 providing the required LD information for custom design of SNP panels that have maximal power to capture the genetic variation in a specific genomic region of interest. In this study, we designed a panel of SNPs for the evaluation of miRNA regions as candidate loci for disease susceptibility in association studies. In particular, the panel is addressed to the study of psychiatric disorders for which the identification of susceptibility genes has been less successful than in other complex disorders.21 The approach proposed here is based on the study of genetic variation in these regulatory elements as possible contributors to psychiatric disease susceptibility. Investigation of how complex gene regulatory networks evolve, and how this results in phenotypic alterations, may represent a useful approach toward understanding human evolution and disease.

SNPs are the best-characterized source of genetic variation in the human genome and SNP density can be used to measure the conservation of DNA sequences. The miRNA regions studied here revealed a low SNP density, which could indicate that, as previously suggested,22, 23, 24 miRNA conservation is important and that changes in these regions may contribute to human disease susceptibility. This is further supported by the fact that only six of the SNPs located in miRNAs were found to be common SNPs with MAF>0.05 in the studied population. It would be interesting to analyze whether the lack of SNPs in miRNAs is indeed because of natural selection or whether other factors, such as mutation rate bias on these genomic regions or the fact that many miRNAs are located in still poorly studied regions, are the cause for the low SNP density observed. However, the low number of SNPs and the lack of population frequency information for many of them make this analysis technically difficult to afford nowadays. The vertiginous acquisition of sequence data from different individuals on many ongoing ultrasequencing projects, together with the increase in the number of newly discovered miRNAs, could make it more affordable in the near future. In fact, very recently, 117 miRNAs have been extensively resequenced in four different human populations in an effort to assess the natural selection of small RNAs during recent human evolution. This analysis reported a lower SNP density in miRNAs than in other noncoding regions, which were shown to be twice as dense.23 This study has also shown that strong purifying selection affects the sequence corresponding to the mature miRNA, as well as the complementary miRNA sequence (miRNA*), stem region and loop, indicating that mutations in miRNA hairpins are likely to be deleterious and may have severe phenotypic consequences on human health. Unfortunately, only 117 out of the actual 718 miRNAs could be resequenced in that study. Nevertheless, as it happened in our case, the fast increase in the number of newly discovered miRNAs is one of the main handicaps that researchers face. Remarkably, since the time that we started the project until now, the number of identified miRNAs has doubled. However, when analyzing the localization of the actual number of 718 miRNAs (miRBase, release 13.0), we observed that our SNP panel (considering 325 of the miRBase, release 7.1) accounts for a variability of around 100 of the newly discovered miRNAs. This is likely because of the fact that many of the new miRNAs are in close vicinity to already known miRNAs.

As a part of the comprehensive study of the genomic localization of miRNAs, we observed that approximately half of the miRNAs are located inside coding genes. As it has been suggested that miRNAs and their host genes are coexpressed and that their action must be coordinated,25 we also wanted to study whether there was enrichment for a particular kind of gene among those that host miRNAs. Intriguingly, we found an enrichment for genes involved in psychiatric disorders, such as the serotonin receptor gene, HTR2C;26 the acetylcholine receptor gene, CHRM2;27 the glutamate receptor ionotropic delta 1, GRID1; or two of the inhibitory neurotransmitter GABA receptor genes (GABRA3 and GABRE).28 This is of particular interest for the goal of our study, as the design of the panel of SNPs is mainly addressed to the study of psychiatric disorders. In fact, promoter regions of miRNA host genes were included among the studied regions because, besides their own interest, inclusion of these regions will also allow to dissect the contribution of these particularly conspicuous genes in association studies and hence evaluate the potential involvement of miRNAs in putative associations. Moreover, although the biogenesis of genic miRNAs is still unclear, intragenic miRNAs seem to be transcribed as part of their hosting transcription units, with the exception of those miRNAs located in antisense orientation to the ‘host’ gene. Thus, transcription of the host gene itself, controlled partially by promoter regions, may be important for miRNA production, and variation in these regions could affect the expression of the hosted miRNA. The comprehensive study of the genomic localization of miRNAs that we performed shows that 93% of the miRNAs are located either inside previously described or predicted transcriptional units or in clusters (only 22 miRNAs could be considered purely isolated and intergenic). It is known that one miRNA can target as many as several hundred genes, but it is also known that one gene can be targeted synergistically by more than one miRNA. Considering such winding regulatory networks, it is tempting to speculate that it might be favorable for the cell to cluster into the same transcriptional unit as those miRNAs and/or genes that act on the same developmental or metabolic pathways.

Finally, another aim of this study was to investigate how well the HapMap European data represent our specific Northeast Spanish (Catalan) population. Comparison of allele frequencies not only confirmed the applicability of our SNP panel but also pointed to two genomic regions that show a geographical genetic variation among populations. The most likely cause for these marked geographical differences is natural selection. This is clearly the case for the three SNPs located in the genomic region corresponding to hsa-mir-128-1 within the R3HDM1 gene, which is in fact in the vicinity (within 1 Mb) of an LD region containing the lactase gene (LCT), for which selection-based evolutionary change in humans has already been established.29 The other SNP is located in a genomic region for which positive selection has not been shown. In fact, analyzing the extension of the regions showing geographical differences, as well as the ancestral alleles, is important to discern whether the real cause for these differences is natural selection. Apart from evolutionary significance, the study of the possible phenotypic consequences of genetic variation within these regions, such as a differential expression of a particular miRNA, may be a matter of concern for disease, in case associations with those regions were found. Nevertheless, caution must be exercised when facing the analysis of these particular regions to avoid spurious associations, as it was the case for the reported association between the lactase gene (LCT) and the tall/short status in a European American sample.30

In conclusion, we performed a comprehensive analysis of the genomic organization of miRNAs and their SNP coverage to build a panel of SNPs for the analysis of complex disease. Aside from limitations imposed from the fast discovery of miRNAs, which makes it difficult to cover the actual number of these regulators, the use of the designed miRNA SNP panel for association studies should help to elucidate the molecular basis of several disorders by means of the identification of still hidden links between miRNAs and human disease.