The Evolution and Expression Pattern of Human Overlapping lncRNA and Protein-coding Gene Pairs

Long non-coding RNA overlapping with protein-coding gene (lncRNA-coding pair) is a special type of overlapping genes. Protein-coding overlapping genes have been well studied and increasing attention has been paid to lncRNAs. By studying lncRNA-coding pairs in human genome, we showed that lncRNA-coding pairs were more likely to be generated by overprinting and retaining genes in lncRNA-coding pairs were given higher priority than non-overlapping genes. Besides, the preference of overlapping configurations preserved during evolution was based on the origin of lncRNA-coding pairs. Further investigations showed that lncRNAs promoting the splicing of their embedded protein-coding partners was a unilateral interaction, but the existence of overlapping partners improving the gene expression was bidirectional and the effect was decreased with the increased evolutionary age of genes. Additionally, the expression of lncRNA-coding pairs showed an overall positive correlation and the expression correlation was associated with their overlapping configurations, local genomic environment and evolutionary age of genes. Comparison of the expression correlation of lncRNA-coding pairs between normal and cancer samples found that the lineage-specific pairs including old protein-coding genes may play an important role in tumorigenesis. This work presents a systematically comprehensive understanding of the evolution and the expression pattern of human lncRNA-coding pairs.

Overlapping genes were first identified in virus 1 and subsequently found in vertebrate genomes 2,3 . Aside from contracting genome size, overlaps have been hypothesized to be involved in regulating gene expression at diverse levels, including transcription, mRNA splicing, transport, processing, stability and translation [4][5][6] . The transcription of antisense genes affects both the splicing and the expression of sense genes in human 7 and the expression of overlapping genes are highly correlated 8,9 . A mutation in overlapping region may disrupt the function of the two genes simultaneously. Nevertheless, overlapping genes do not show higher sequence conservation compared with non-overlapping genes and the overlap structure are poorly preserved during evolution 8,10,11 .
Several hypotheses have been proposed to explain the origin of overlapping genes [11][12][13] . Generally, because of the interdependence of overlapping genes, overlapping regions are reasonably under strong selective pressure. In fact, both purifying selection and positive selection have been found in members of overlapping genes [14][15][16] , which provides evidence for the hypothesis that overlapping genes could originate via overprinting, a process generating new genes from pre-existing sequences 14,16 . A distinctive characteristic of overlapping genes originated from overprinting is that the new genes appear to be lineage-specific and the old partners are widespread across Scientific RepoRts | 7:42775 | DOI: 10.1038/srep42775 species 13 . Another study of overlapping genes, ACAT2 (acetyl-CoA acetyltransferase 2) and TCP1 (t-complex protein 1), showed that the overlap of two previously separated genes may arise during evolution through one of two ways. In one scenario, one of the genes may lose functional signals through translocation. By chance, adoption of lost signals from the new neighboring gene let this gene continue to function normally and the two genes were overlapped. Or, two fixed genes became neighboring genes through genomic rearrangement and subsequent change in the gene structure resulted in overlap 12 .
According to the coding potential of genes, overlapping genes can be categorized as coding-coding, coding-noncoding and noncoding-noncoding pairs 17 . lncRNAs are known to regulate the expression of protein-coding genes through cis-acting or trans-acting regulation mechanisms [18][19][20][21] . As expected for regulatory molecules, lncRNAs tend to be expressed at lower level and display higher tissue specificity than protein-coding genes 22,23 . Although numbers of lncRNAs are conserved across vertebrates [22][23][24] , most lncRNAs are subject to rapid turnover during evolution in terms of sequence and transcription 22,25 . Until now, lncRNAs overlapping with protein-coding genes have got particular attention and many studies have uncovered various mechanisms of lncRNAs regulating the expression of their protein-coding overlapping partners 19,26,27 . The dysregulation of overlapping lncRNAs also has been observed in cancer [28][29][30] and mutated lncRNAs co-localized with protein-coding genes may act as prognostic biomarkers and therapeutic targets for cancer [30][31][32] .
Herein, we showed a systematically comprehensive understanding of the evolution and expression pattern of lncRNA-coding pairs in human genome. Through testing the origin of lncRNA-coding pairs, we observed the preference for the retention of genes in lncRNA-coding pairs during evolution. The overlapping configuration and the evolutionary age of genes were taken into account when estimating the effect of overlap on expression and co-expression of lncRNA-coding pairs. Further investigation was conducted by comparing behaviors of lncRNA-coding pairs between carcinomas and normal samples, which is indicative of the contribution of lncRNA-coding pairs to tumorigenesis.

Results
Overlap benefits the retention of genes. We initiated our study on the data originally produced by Necsulea et al. 22 . with a particular focus on human lncRNA genes overlapping with protein-coding genes. Of the total 24,793 annotated human lncRNA genes, about 29% were overlapped with protein-coding genes ( Table 1) and 26% of protein-coding genes were in overlap (Supplementary Table 1).
It has been suggested that lncRNA genes evolve more rapidly than protein-coding genes 25 and overlapping genes occur in a continuous evolutionary process 11 . We therefore asked whether the evolutionary age of genes would influence the overlap of lncRNA with protein-coding genes. In general, lncRNAs were younger than their protein-coding overlapping partners in most (86.5%) lncRNA-coding pairs. Only around one-tenth of lncRNA-coding pairs shared the same time period of origin and about 86% of pairs included old protein-coding genes originated more than 300 million years (Myr) ago (Supplementary Table 2). There were 108 clusters that lncRNAs of distinct times of origin overlapped with a single protein-coding gene. GO analysis of these protein-coding genes showed strong enrichment for terms related to the neurogenesis and hippocampus development (q value = 0.02). These lncRNAs, through successive waves of origination, may have contributed to the evolution and functional refinement of human neurons.
To address the impediment imposed by the insufficient genome annotations of some species, we integrated the human lncRNA genes into three age groups and observed that the percentage of lncRNA genes overlapping with protein-coding genes increased significantly with their evolutionary age ( Table 1). The same trend was observed in protein-coding genes (Supplementary Table 1). These observations could be explained by two reasons. One is that there are selective pressures for the retention of genes in this genomic organization. The other one is that established genes are advantageous to the occurrence of overlap, indicating that lncRNA-coding pairs mainly originate from two fixed genes.
We further investigated the evolutionary pattern of human overlapping genes based on comparisons with chimpanzee and mouse genomes. The evolutionary scenarios revealed that the overlap of lncRNAs and protein-coding genes occurred more likely as a result of overprinting (pattern 4, 7, 8, Fig. 1), and 49% of lncRNA-coding pairs fit exactly the hypothesis. By contrast, orthologs of coding-coding pairs frequently existed but did not overlap in the chimp and mouse (patterns 9-11, Fig. 1), which is consistent with the hypotheses that overlapping genes could be generated by genomic rearrangement and adoption of signals from neighboring genes or by change in gene structure. There were only 150 (2%) lncRNA-coding pairs fitting this pattern, indicating that the higher percentage of overlapping genes in old group is mainly caused by the evolutionary advantage for the retention of genes in overlap. The preference of overlapping configurations based on the origin is preserved through evolution.
To test whether the overlapping configuration would affect the evolution of lncRNA-coding pairs, we first classified them into 5 groups depending on the orientation of transcripts involved. Pairs overlapped on the opposite strand were classified as: head-to-head (H2H, 5′ -regions overlap), tail-to-tail (T2T, 3′ -regions overlap) and embedded (OEB) pairs. And pairs overlapped on the same strand were classified as: head-to-tail (H2T, 5′ -region overlap with 3′ -region) and embedded (SEB) pairs (Fig. 2a). Generally, overlaps on the opposite strand amounted to almost ninety-three percent and embedded pairs were much more than partially overlapping genes (Supplementary Table 3). Considering the evolutionary age of lncRNAs, old lncRNA genes were more likely to be embedded within protein-coding genes on the opposite strand but less on the same strand. An exactly opposite tendency of young lncRNA genes was observed (Fig. 2b,c). Additionally, old lncRNA genes showed lower preference for H2H compared with young lncRNA genes (Fig. 2c). These observations suggest that lncRNA-coding pairs express a strong preference to be embedded and different-strand overlaps. Theoretically, activating two overlapping transcriptional units at the same time is unlikely, which would result in transcriptional interference 6 . And we found that protein-coding genes overlapped with lncRNAs on the same strand had significantly lower expression level than on the opposite strand (Wilcoxon signed-rank test, p value = 2.4 × 10 −3 ), with a median RPKM of 10.7 on the same strand and 12.7 on the opposite strand, respectively. Therefore, lncRNAs overlapped protein-coding genes on the same strand are less desirable. Since few lncRNA-coding pairs originate from genomic rearrangement and change in gene structure, the main sources of partially overlapping genes, embedded pairs are easy to be found in lncRNA-coding pairs.
We then assessed the evolutionary conservation of lncRNA-coding pairs in the sense of genomic structure and overlapping configuration. Of the total 7,876 human lncRNA-coding pairs, only orthologs of 487 pairs involved in overlaps both in the chimpanzee and mouse genome ( Fig. 1, Supplementary File 1). But the composition of overlapping configurations in the conserved pairs was not significantly different from the total pairs (Supplementary Table 3), suggesting that the overlapping configuration is not related to the evolution of overlap. All the above observations demonstrate that the origin of overlapping genes confines the preference of overlapping configurations which is preserved during the evolution of overlapping genes.
The alternative splicing pattern of lncRNA-coding pairs is related to the overlapping configuration. It has been reported that a number of lncRNA genes possess the canonical splice site consensus motifs 33 and the antisense expression can affect mRNA splicing of sense genes 7 . To further explore whether the overlapping configuration would affect the alternative splicing pattern of lncRNA-coding pairs, we downloaded the alternative transcript annotations of human lncRNA and protein-coding genes from Ensembl 34 . Around 26% of the annotated human lncRNAs produced alternative transcripts, and 48% of them overlapped with protein-coding gene(s) (Supplementary Table 4). According to the number of alternative transcripts annotated for the lncRNA and protein-coding gene, the alternative splicing patterns of lncRNA-coding pairs were classified as single-to-single (SS), single-to-multiple (SM), multiple-to-single (MS) or multiple-to-multiple (MM) patterns (the first letter was representative for the lncRNA gene and the second for the protein-coding gene). There was a clear association between the alternative splicing pattern and the overlapping configuration of lncRNA-coding pairs (Supplementary Table 5). As shown in Fig. 3b, those lncRNA-coding pairs with SS pattern were more likely to be embedded on the same strand and more SM pairs were observed with embedded form and less with partially overlapping form. For lncRNA-coding pairs with MS pattern, the H2T configuration was preferred over other configurations and those MM pairs showed right opposite preference with SM pairs, more with partially overlapping form and less with embedded form. These observations imply that the alternative splicing pattern of lncRNA-coding pairs is related to the type of overlapping configuration.
Antisense lncRNA could affect the alternative splicing of sense protein-coding gene 35 and overlap regions are potential hotspots for the splicing regulation 7 . Consistent with that, there were only a few lncRNA-coding pairs with SS pattern and the majority of pairs were overlaps with SM and MM patterns (Fig. 3a). Furthermore, more protein-coding genes generating multiple products overlapped with lncRNAs, but lncRNAs did not (Supplementary Table 4). It reveals that the antisense transcription-mediated mechanism of splicing regulation is a unilateral interaction.

Overlapping genes have higher expression level and tissue specificity. The antisense expression
has been reported to affect the expression of sense genes 7 , then the potential regulatory interactions mediated by the genomic organization was assessed. For the young protein-coding genes (age < 90 Myr), the expression levels of overlapping genes were significantly higher than that of non-overlapping ones and the gap narrowed with the increase of evolutionary age (Fig. 4a). And for lncRNAs, the expression levels of overlapping genes were higher than non-overlapping genes in all groups (Fig. 4c). The data suggest that the genomic structure may benefit the expression of lncRNA-coding pairs and the effect on genes is age-specific. Additionally, in old group, both lncR-NAs and protein-coding genes in lncRNA-coding pairs had higher tissue specificity than non-overlapping genes (Fig. 4b,d), which indicates that overlap may diversify the function of genes through confining the expression spectrum of overlapping genes. Taken together, the genomic organization improves the expression level and is conducive to confining the expression breadth of genes. The effect of overlap on gene expression is more complex in chimp and mouse. Similarly, for protein-coding genes, the existence of overlapping partners increased the expression level of young genes and the tissue specificity of old genes. But the effect of overlap on lncRNAs was a little different, where the tissue specificity was lower than non-overlapping genes ( Supplementary Figs 1 and 2).
To explore the effect of overlap on the expression conservation of genes, the conservation score of gene expression was calculated. The expression of protein-coding genes in lncRNA-coding pairs was more conserved than non-overlapping genes, whereas lncRNAs in overlap had lower expression conservation than non-overlapping genes (Fig. 5a), suggesting that the genomic structure promotes the expression conservation of protein-coding genes rather than lncRNAs. For the 487 conserved lncRNA-coding pairs, the expression conservation scores of protein-coding genes were skewed towards the highest value (Fig. 5b), while the score of lncRNA genes showed a broader distribution (Fig. 5c), which is consistent with the finding that lncRNAs have more rapid transcriptional turnover than protein-coding genes 22,25 . The conservation scores of the expression ratios of lncRNAs over their protein-coding overlapping partners were also calculated and the value was scattered as lncRNAs (Fig. 5d), suggestive of the barely conserved coordinated expression of the lncRNA-coding pairs. The conservation degree of the expression ratios was significantly correlated with lncRNAs (Fig. 5e), whereas no significant correlation was observed when considering protein-coding genes (Fig. 5f), which confirms the regulatory role of lncRNAs.
Genes in lncRNA-coding pairs are widely co-expressed. Overlapping genes are known to couple gene expression 9 . We thus tested the expression correlation of lncRNA-coding pairs and observed that the expression of lncRNA-coding pairs showed an overall positive correlation, with a median Spearman correlation coefficient  (Fig. 6b).
It has been well studied that the bidirectional-like promoters contribute to the coordinated expression of H2H pairs 9,36 . To assess the effect of bidirectional promoters, we roughly searched for identical transcription factor binding sites (TFBSs) within the 1-kb upstream genomic regions of the two transcriptional start sites. More H2H pairs contained identical TFBS(s) within the two independent upstream regions when compared with the other two overlapping configurations on the opposite strand (Supplementary Table 6) and only H2H pairs with identical TFBS(s) had higher expression correlation than pairs with no identical TFBS ( Supplementary Fig. 3), suggesting that the expression of H2H pairs is likely coordinated by similar regulatory sequences.
Previous study has proved that lncRNAs and nearby protein-coding genes are co-expressed 37 , then we grouped lncRNA-coding pairs with neighboring pair(s) within a 40-Kb genomic distance into blocks to estimate the effect of local genomic environment. Around 54% of lncRNA-coding pairs were falling into blocks with more than one pair (Supplementary File 2). The expression correlation coefficients of lncRNA-coding pairs were less dispersed among the block pairs (mean SD = 0.24) than the corresponding individual pairs (SD = 0.44; Student test for the mean difference, p value < 2.2 × 10 −16 ).
Taking the evolutionary age of genes into account, the expression correlation of lncRNA-coding pairs was significantly weakened with the increased evolutionary age of protein-coding genes, but not with lncRNAs. Young protein-coding genes originated less than 90 Myr ago had a relatively stronger correlation (median R = 0.39) than old protein-coding genes (median R = 0.13) with their lncRNA overlapping partners ( Supplementary Fig. 4). It could partially be explained by the fact that old protein-coding genes are required for the maintenance of the cell fundamental functions and their expression should remain a relatively stable level. These results together suggest that the overlapping configuration, local genomic environment and evolutionary age of genes have an influence on the expression correlation of lncRNA-coding pairs.

Signatures of co-expression of lncRNAs-coding pairs for carcinoma.
Potential lncRNA-disease associations have been identified by computational models [38][39][40][41][42] and aberrant expression of antisense RNA may contribute to cancers [43][44][45] . As co-expression between overlapping partners has been frequently reported 46,47 , we investigated whether there existed any signature in dysregulated coordinated expression of lncRNA-coding pairs using an RNA sequencing dataset of 369 cancer samples 9 . Genes with low level of expression were excluded and 2,122 human lncRNA-coding pairs (Supplementary File 3) were left for the further analysis. The patterns of the expression correlation of lncRNA-coding pairs were distinct in normal and cancer (Fig. 7a) and the lncRNA-coding pairs displayed significantly higher correlation in cancer (Fig. 7b). Around 52% of lncRNA-coding pairs were only significantly correlated in cancer and only about two percent showed an opposite tendency. Six percent of lncRNA-coding pairs were correlated in both normal and cancer samples (Fig. 7d).
The expression of non-conserved or lineage-specific lncRNA-coding pairs had significantly higher correlation in cancer, while pairs conserved in human, chimp and mouse genomes did not (Fig. 7c). For the three age groups of protein-coding genes, only the expression of old genes showed significantly stronger correlation with their partners in cancer (median R = 0.32) than in normal (median R = 0.14, Supplementary Fig. 5a). The possible reasons may be that a small portion of pairs included protein-coding genes originated less than 300 Myr ago and those protein-coding genes showed no significant functional enrichment, as well as genes in conserved pairs ( Supplementary Figs 6 and 7). In contrast, old genes in non-conserved pairs are functional in various processes, like development and also cell-cell signal pathway (Supplementary File 5). The regulatory phenotypic profiles as a part of cancer hallmark network framework would lead to clinical phenotype 48 . Therefore, we could speculate that the altered expression correlation pattern of lineage-specific lncRNA-coding pairs, especially pairs containing protein-coding genes originated more than 300 Myr ago, may play an important role in tumorigenesis. But for coding-coding pairs, the expression correlation was stronger in cancer among all age groups and that of conserved pairs also showed significant increase in cancer ( Supplementary Fig. 5c,d).
Several lncRNA-coding pairs with exactly opposite types of correlational relationship in cancer and normal were identified (Supplementary File 4). Interestingly, the expression of SAMSN1 and SAMSN1 antisense RNA 1 were negatively correlated in normal (R = − 0.85, p = 0.03), but positively correlated in cancer (R = 0.70, p = 4.2 × 10 −9 ), implicating the absence of the suppression of SAMAN1 by lncRNA in cancer. SAMSN1 is predominantly expressed in immune tissues and hematopoietic cells, with lower expression in heart, brain, placenta, and lung 49 . Since the expression data of cancer we used were mainly from lung, prostate, ovary and brian, it was reasonable that the lncRNA-coding pair was positively correlated in cancer. Previous studies have testified that the SAMSN1 expression is low or absent in human myeloma cell lines 50 and the absence of SAMSN1 contributes to multiple myeloma progression 51 . But the SAMSN1 is over-expressed in glioma and the high expression of SAMSN1 is a significant risk factor for the progression of glioblastoma multiforme. Thus the altered correlational relationship of SAMAN1 and SAMSN1 antisense RNA 1 may serve as a biomarker for the prognosis and therapy of cancer.

Discussion
Genes in lncRNA-coding pairs are more likely to be retained throughout evolution. Protein-coding overlapping genes originated through overprinting are constrained to the 123:132 phase which ensures the least mutual constraint on both protein sequences 15 . Since lncRNA genes have no reading frame, evolve rapidly and are less conserved than protein-coding genes in terms of sequence 25 , it is more likely for lncRNA to be generated from an pre-existing coding sequence. Indeed, nearly half of lncRNA-coding pairs were found to be generated by overprinting and few pairs were from changing the spatial relationship of two separated genes. However, most human coding-coding pairs were the results of genomic rearrangement or elongation of two genes, similar with the study of Fukuda et al. 52 . Considering the origin of overlaps, the trend that the percentage of genes in overlap increases with the evolutionary age declares that overlap is advantageous to the retention of genes throughout evolution. Furthermore, the observation that protein-coding genes overlapped with lncRNAs originated from different time periods, could play a role in establishing or maintaining cellular diversity and may contribute to the species diversification.
Overlapping configurations are mainly affected by the origin of overlapping genes. Partially overlapping genes usually arise from genomic rearrangement or elongation of two fixed genes and introns as a valuable evolutionary source for overprinting 13 , hints that overlapping genes originated from this way may occur as embedded pairs. Since more lncRNA-coding pairs were generated by overprinting and few from the change of the spatial relationship of two fixed genes, higher percentage of embedded pairs was observed. Also, the different-strand overlaps accounted for the majority of lncRNA-coding pairs because of the transcription interference of overlaps on the same strand. But there was no difference between the overlapping configuration compositions of the conserved and all lncRNA-coding pairs. These results suggest that the overlapping configuration only depends on the origin of overlapping genes and the subsequent evolution has no influence on it.
Overlap enhances the expression level and tissue specificity of genes in lncRNA-coding pairs and these effects are age-specific. The expression level of genes in lncRNA-coding pairs was higher than that of non-overlapping genes and the increase was more obvious in young group. By contrast, the tissue specificity of overlapping genes was remarkably improved in old group. These may give us a clue that the existence of overlapping partners adjusts the expression level and expression breadth of genes in lncRNA-coding pairs and these effects differ at different age groups. Considering the expression conservation of genes, the expression of protein-coding genes was much more conserved than lncRNA genes. However, the overlap structure only improved the expression conservation of protein-coding genes not lncRNA genes. Comparisons of the expression ratio conservation with the expression conservation of lncRNA and protein-coding genes in lncRNA-coding pairs confirmed the regulatory role of lncRNAs.
Expression correlation is a predominant characteristic of overlapping genes 8,9 and overlapping configurations, local genomic environment and the evolutionary age of genes are important factors influencing this correlation. SEB pairs, under common regulatory system, showed the highest expression correlation and H2H pairs had higher correlation than other different-strand overlaps. It has been reported that bidirectional promoter coordinates the expression of sense gene and antisense lncRNA 53 , which would be the reason for the stronger correlation of H2H pairs. The deviation of expression correlation coefficient of individual lncRNA-coding pairs was significantly larger than those in blocks, suggestive of the important role of the local genomic environment. Then, taking the evolutionary age of genes into account, newly evolved protein-coding genes had higher expression correlation with their lncRNA partners than old genes. It indicates that young protein-coding genes are more flexible than old genes whose expression should be maintained at a relatively stable level.
The expression correlation pattern of lncRNA-coding pairs was altered in cancer, which may contribute to tumorigenesis. Although the expression correlation of lncRNA-coding pairs was higher in caner, the conserved pairs and pairs including protein-coding genes originated less than 300 Myr ago had no significant difference between normal and cancer, which is different from coding-coding pairs. For lncRNA-coding pairs, old protein-coding genes in non-conserved pairs showed functional enrichment in terms of development and morphogenesis, which remind us that the aberrant regulatory phenotype of those pairs play an important role in carcinogenesis. Additionally, pairs both correlated in normal and cancer tissues with opposite type of correlational relationship may promote pathogenesis of cancer.
Through detecting the orthologs of human lncRNA-coding pairs in chimp and mouse genomes, initial attempts were made to investigate the origin and evolution of lncRNA-coding pairs. We are well aware that the study on few genomes may lead to biased conclusions, so further comparative studies about lncRNA-coding pairs based on more well-annotated genomes are necessary. However, our study did present a relatively comprehensive understanding of the evolution and expression pattern of lncRNA-coding pairs.

Data and Methods
Data. The annotations of lncRNA and protein-coding genes used to identify lncRNA-coding pairs, the orthology information of lncRNA genes, the strand-specific and non-strand-specific expression data for expression correlation, tissue specificity and expression conservation were obtained from Necsulea et al. 22 . The alternative transcripts information of human lncRNA and protein-coding genes (Ensembl v85) was downloaded from Ensembl Genome Browser database (http://www.ensembl.org/index.html) 34 . Considering the genome annotation version used by Necsulea et al. to annotate lncRNA and protein-coding genes, the conserved transcription factors binding sites (TFBSs) based on GRCh37 were downloaded from UCSC Genome Browser database (http:// genome.ucsc.edu/) by Table Browser 54 . In addition, the expression data of 369 carcinoma samples were obtained from Balbin et al. 9 .
Identification of lncRNA-coding pairs. The lncRNA-coding pairs were identified under the criteria that two transcripts shared at least one nucleotide and only the longest form of alternative splicing was considered.
To estimate the effect of local genomic environment on the co-expression of lncRNA-coding pairs, all pairs were grouped into distinct blocks. If the lncRNA-coding pair has neighboring pair(s) within a 40-Kb genomic distance, these pairs were considered as a block. All pairs in this block were then detected until no pairs had neighboring pair within a 40-Kb genomic region.
Identification of orthologous genes and the evolutionary age of human protein-coding genes. Based on the homology information of any two genomes provided in InParanoid8 55 and Ensembl 34 , the orthology inference was done. Orthologous genes of human protein-coding genes were first selected from InParanoid8 (http://inparanoid.sbc.su.se/download/) 55 by inparalog score equal to one in 9 species: chimpanzee, gorilla, orangutan, macaque, mouse, opossum, platypus, chicken and Xenopus. The threshold used to select orthologous genes from Ensembl is the identity great than 55%, a value below which were the 5% of the identity scores between orthologous genes from InParanoid8 and human genes. The union set of orthologs from InParanoid8 and Ensembl in given genomes was then used in subsequent analysis. Briefly, the minimum evolutionary age of protein-coding genes was inferred based on the presence of orthologs without taking transcription evidence into account, as lncRNA genes inferred by Necsulea et al. 22 .
Construction of evolutionary scenarios of human lncRNA-coding and coding-coding pairs. For each of lncRNA-coding or coding-coding pairs in human genome, spatial relationships of their orthologs in chimp and mouse genomes were checked based on the orthology inferred above. Since the relationship between human protein-coding genes and their orthologous genes in other species was not a simple one-to-one relationship, each of the orthologous genes was checked for overlaps in corresponding genome. Based on the presence and spatial relationships of orthologs in given genomes, the evolutionary scenarios of human overlapping genes were classified into sixteen patterns as in Fig. 1. Calculation of tissue specificity. To detect the expression specificity of lncRNA and protein-coding genes across tissues, the mean expression levels of genes in each tissue were obtained. We used the following algorithm proposed by Landgraf et al. 56 to calculate the tissue specificity of the expression of lncRNA and protein-coding genes: Then the conservation score of gene i among three species: Total Conservation Score W (C (j) 1) W (C (j) 1) W (C (j) 1) where W 1,2 was the phylogenetic distance between species 1 and 2. Considering that the conservation score ranges from -1 to 1, we added 1 to adjust the conservation score to positive when weighted by the pair-wise phylogenetic distance. The conservation score of expression ratio was also calculated as above.