Host gene constraints and genomic context impact the expression and evolution of human microRNAs

Increasing evidence has shown that recent miRNAs tend to emerge within coding genes. Here we conjecture that human miRNA evolution is tightly influenced by the genomic context, especially by host genes. Our findings show a preferential emergence of intragenic miRNAs within old genes. We found that miRNAs within old host genes are significantly more broadly expressed than those within young ones. Young miRNAs within old genes are more broadly expressed than their intergenic counterparts, suggesting that young miRNAs have an initial advantage by residing in old genes, and benefit from their hosts' expression control and from the exposure to diverse cellular contexts and target genes. Our results demonstrate that host genes may provide stronger expression constraints to intragenic miRNAs in the long run. We also report associated functional implications, highlighting the genomic context and host genes as driving factors for the expression and evolution of human miRNAs.

Number of human inter-and intragenic miRNAs emerged in each branch across the vertebrate lineage.
Here we analyzed only expressed miRNAs and those distant from 10 kb to each other were merged into a single cluster. The representative miRNA of the cluster was randomly picked. Number of miRNAs per million years (mya) were calculated by the ratio of the number of inter-or intragenic miRNAs emerged in each branch (see Fig. 1A, main text) to the time elapsed from the previous branch. For example, the gain rate of intergenic miRNAs in branch 2 (chicken) is Ninter ∕ Db12-b1 -Db12-b2, where Ninter is the number of intergenic miRNAs emerged in branch 2; Db12-b1 is the divergence time between branch 12 (human) and branch 1 (fish) and Db12-b2 is the divergence time between branch 12 and branch 2 (chicken).
Significant differences (branches 5, 6, 7, 8, 12; P  0.05) between the quantities of inter-and intragenic miRNAs in each branch were assessed by Binomial tests assuming equal probabilities (0.5). Divergence times were obtained from timetree.org. (c) Number of human inter-and intragenic miRNAs emerged in each branch across the vertebrate lineage. Here we analyzed only the miRNAs annotated by Fromm et al. (2015) 1 . This dataset consists of a re-annotation of miRBase (v.21) entries based on stringent criteria to exclude non-bona fide miRNAs. For humans, the number of true-positives was reduced to one third of the miRBase entries. Even with this highly stringent annotation, the excess of intragenic miRNAs in branch 7 persists (P  0.05, binomial-test). The excess of inter-or intragenic miRNAs was not significant in all other branches. (d) Age relationships among intragenic miRNAs and their host genes. Horizontal line lengths are proportional to the frequency of miRNAs and host genes of each age. (e) Null distribution of the frequency of old genes. Equal-sized sets of genes just as or older than the sets of miRNAs owing a particular age were randomly sampled 10,000 times (SD = 0.015; 95% CI = 0.68 ± 0.0002). The observed greater proportion of old host genes is represented by the vertical line (P  0.0001 or P = 3.93 × 10 -11 , chi-squared test). (f) Length distribution of host genes. Gene length was given by the distance between annotated start and end genomic coordinates represented in log2 scale. We used the longest gene transcript as the reference and single exon genes were excluded. Distributions show that old genes are longer than young ones (P = 0.002, Mann-Whitney U test). (g) Length distribution of host genes controlling for young host length. We generated a random distribution centered on the approximate median of young host lengths to rule out a possible bias as old genes are longer than young ones.
Adopting this procedure, no significant difference between the two distributions was observed (P = 0.78, Mann-Whitney U test) but we still observed a similar proportion (80%) of old host genes as the original data (i.e. no difference compared to the whole dataset: P = 0.57, chi-squared test). (h) Expression breadth 2 of old and young host genes. Old host genes are more broadly expressed (lower ) than young ones (P = 1.4 × 10 -6 , Mann-Whitney U test). (i) Expression breadth distribution of host genes controlling for young host expression breadth. Similarly to Fig. S1G, we generated a random distribution centered on the approximate median of young host expression breadth to rule out a possible bias as old genes have higher expression breadth than young ones. Adopting this procedure, no significant difference between the two distributions was observed (P = 0.2, Mann-Whitney U test) but we still observed the similar proportion (81.8%) of old host genes as the original data (i.e. no difference compared to the whole dataset: P = 0.76, chi-squared test). (j) Young host and non-host genes' frequency according to their ages. Single exon genes were excluded to avoid new gene bias in non-host genes due to excess of retrogenes. Also for young genes (age ≥ 2) frequencies of each age are different between the two categories. Host genes are overrepresented for age 2 (the oldest one among young genes) (P  0.0001, chi-squared test).

Supplementary Figure 2. Age of host and non-host genes using alternative dating methods. (a)
Frequency of old and young host and non-host genes using Ensembl orthology (v.71). The 1:1 orthologs between human and additional 29 species were obtained from Ensembl (see Supplementary Fig. 11). Old genes were defined as those with orthologs in at least one fish species and young genes were those emerged after the fish divergence ( Supplementary Fig. 11). A chi-squared test revealed that host genes are overrepresented in the Old group relative to the Young (P = 1.88 × 10 -13 ). (b) Host and non-host gene frequency according to age classes defined by Chen et al. (2012) 3 . Their dataset provides ages for mouse genes defined as follows: Cellular organisms (C. org); Eukaryota (Euk); Fungi/Metazoa (F/Met); Metazoa (Met), Chordata (Cho) and Mammalia (Mam). We then used the human-mouse 1:1 orthologs obtained from Ensembl to assign these age classes to host and non-host genes. Using chi-squared tests, we observed that host genes are overrepresented (in respect to non-hosts) in older groups (Eukaryota: P Expression breadth using RNA-seq data obtained from the same study (Meunier et al. 2013) 5 , measured for five tissues (brain, cerebellum, heart, kidney and testis). It is noticeable that expression levels increase with miRNA age (i.e. older). Interestingly, by excluding expression from testis, young intragenic miRNAs (age 7-12) within old host genes are more highly expressed than the intergenic counterpart. (b) When we considered all 12 tissues (i.e., including testis) and MirGeneDB annotation, the difference in A was not observed, indicating testis-specific expression for young intergenic miRNAs.
(c) Boxplots of the median of miRNA expression levels based on the miRNAs annotated by miRBase.
By excluding expression from testis, no significant differences were observed between miRNAs of the same age classes. (d) When we considered all 12 tissues (i.e., including testis) and miRBase annotation we observed that young intergenic miRNAs (age 7-12) are more highly expressed than young intragenic miRNAs. The difference in C is therefore likely due to testis-specific expression. (** P  0.0001; * P  0.06; Mann-Whitney U test).

Supplementary Figure 6. Co-expression analysis of intragenic miRNAs and their host genes.
Histograms represent the null distributions of the average of co-expression ratios between miRNAs and their host genes. We used expression datasets from Meunier et al. To rule out the possibility that the lack of correlations in B and C were due to possible extended transcription of closest genes and proximal miRNAs in a tissue-specific manner, we sought for reads spanning the interval of gene 3'end and proximal miRNAs in 16 tissues (Illumina Human Body Map 2.0). No evidence of this mechanism was found. A similar pattern was observed for all 20 miRNAs downstream to the closest gene up to 10 kb.

Supplementary Figure 8. Sequence conservation of human miRNAs. (a) Sequence conservation is
represented by the average of phyloP scores along the precursor sequence. It is noticeable that older miRNAs have higher phyloP scores, reflecting our age class definitions. We detected a significant score difference between young (age 7-12) intragenic miRNAs within old host genes (blue) and young intergenic (red) miRNAs (**, P  0.001, Mann-Whitney U test). However, we noticed that this difference also appear when we compare the random backgrounds (see Methods in the main text). Thus, the mentioned difference is likely due to the surrounding genomic region, rather than a differential conservation of the precursor sequences. (b) Conservation of the seed sequences compared to the precursors. All comparisons in the same age classes (and also with the random background) revealed a significant higher conservation of the seed sequence (all P  0.01, Mann-Whitney U test), except for intragenic miRNAs of ages 5-6 and 7-12 within young hosts. Taking into account the relatively short evolutionary time since the divergence of primates, it is notable that the seed of young miRNAs exhibited higher scores in respect to both the precursor and random background. However, although this could suggest a prompt seed conservation, a careful interpretation is needed because in addition to the methodological difficulty to detect selection over such a brief time scale, their phyloP scores are actually on the range of the neutral expectation (close to 0.0).