Functional classification of long non-coding RNAs by k-mer content

Abstract

The functions of most long non-coding RNAs (lncRNAs) are unknown. In contrast to proteins, lncRNAs with similar functions often lack linear sequence homology; thus, the identification of function in one lncRNA rarely informs the identification of function in others. We developed a sequence comparison method to deconstruct linear sequence relationships in lncRNAs and evaluate similarity based on the abundance of short motifs called k-mers. We found that lncRNAs of related function often had similar k-mer profiles despite lacking linear homology, and that k-mer profiles correlated with protein binding to lncRNAs and with their subcellular localization. Using a novel assay to quantify Xist-like regulatory potential, we directly demonstrated that evolutionarily unrelated lncRNAs can encode similar function through different spatial arrangements of related sequence motifs. K-mer-based classification is a powerful approach to detect recurrent relationships between sequence and function in lncRNAs.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Overview and initial test of k-mer-based sequence comparison.
Fig. 2: LncRNAs of related function often have related k-mer contents.
Fig. 3: LncRNA localization and protein-binding correlate with k-mer content.
Fig. 4: K-mer content correlates with lncRNA repressive activity.
Fig. 5: Mapping of elements required for repression by Xist-2kb in TETRIS.

Data availability

The datasets generated during and/or analyzed during the current study are available within the article and its supplementary information files.

References

  1. 1.

    Iyer, M. K. et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47, 199–208 (2015).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  2. 2.

    Geisler, S. & Coller, J. RNA in unexpected places: long non-coding RNA functions in diverse cellular contexts. Nat. Rev. Mol. Cell Biol. 14, 699–712 (2013).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  3. 3.

    Holoch, D. & Moazed, D. RNA-mediated epigenetic regulation of gene expression. Nat. Rev. Genet. 16, 71–84 (2015).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  4. 4.

    Liu, X., Hao, L., Li, D., Zhu, L. & Hu, S. Long non-coding RNAs and their biological roles in plants. Genomics Proteomics Bioinformatics 13, 137–147 (2015).

    Article  PubMed Central  PubMed  Google Scholar 

  5. 5.

    Rinn, J. L. & Chang, H. Y. Genome regulation by long noncoding RNAs. Annu. Rev. Biochem. 81, 145–166 (2012).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  6. 6.

    Gutschner, T. & Diederichs, S. The hallmarks of cancer: a long non-coding RNA point of view. RNA Biol. 9, 703–719 (2012).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  7. 7.

    Lee, J. T. & Bartolomei, M. S. X-inactivation, imprinting, and long noncoding RNAs in health and disease. Cell 152, 1308–1323 (2013).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  8. 8.

    Wu, X. & Sharp, P. A. Divergent transcription: a driving force for new gene origination? Cell 155, 990–996 (2013).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  9. 9.

    Cech, T. R. & Steitz, J. A. The noncoding RNA revolution-trashing old rules to forge new ones. Cell 157, 77–94 (2014).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  10. 10.

    Hezroni, H. et al. Principles of long noncoding RNA evolution derived from direct comparison of transcriptomes in 17 species. Cell Rep. 11, 1110–1122 (2015).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  11. 11.

    Cabili, M. N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  12. 12.

    Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  13. 13.

    Bateman, A. et al. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).

    Article  CAS  Google Scholar 

  14. 14.

    Berman, H., Henrick, K. & Nakamura, H. Announcing the worldwide Protein Data Bank. Nat. Struct. Biol. 10, 980 (2003).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  15. 15.

    Ulitsky, I. & Bartel, D. P. lincRNAs: genomics, evolution, and mechanisms. Cell 154, 26–46 (2013).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  16. 16.

    Kutter, C. et al. Rapid turnover of long noncoding RNAs and the evolution of gene expression. PLoS Genet. 8, e1002841 (2012).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  17. 17.

    Necsulea, A. et al. The evolution of lncRNA repertoires and expression patterns in tetrapods. Nature 505, 635–640 (2014).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  18. 18.

    Eddy, S. R. Computational analysis of conserved RNA secondary structure in transcriptomes and genomes. Annu. Rev. Biophys. 43, 433–456 (2014).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  19. 19.

    Quinn, J. J. et al. Rapid evolutionary turnover underlies conserved lncRNA-genome interactions. Genes Dev. 30, 191–207 (2016).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  20. 20.

    Eddy, S. R. Homology searches for structural RNAs: from proof of principle to practical use. RNA 21, 605–607 (2015).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  21. 21.

    Wheeler, T. J. & Eddy, S. R. nhmmer: DNA homology search with profile HMMs. Bioinformatics 29, 2487–2489 (2013).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  22. 22.

    Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276–277 (2000).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  23. 23.

    Ray, D. et al. A compendium of RNA-binding motifs for decoding gene regulation. Nature 499, 172–177 (2013).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  24. 24.

    Stefl, R., Skrisovska, L. & Allain, F. H. RNA sequence- and shape-dependent recognition by proteins in the ribonucleoprotein particle. EMBO Rep. 6, 33–38 (2005).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  25. 25.

    Edgar, R. C. & Batzoglou, S. Multiple sequence alignment. Curr. Opin. Struc. Biol. 16, 368–373 (2006).

    CAS  Article  Google Scholar 

  26. 26.

    Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    CAS  Article  PubMed Central  Google Scholar 

  27. 27.

    Pervouchine, D. D. et al. Enhanced transcriptome maps from multiple mouse tissues reveal evolutionary constraint in gene expression. Nat. Commun. 6, 5903 (2015).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  28. 28.

    Chadwick, B. P. Variation in Xi chromatin organization and correlation of the H3K27me3 chromatin territories to transcribed sequences by microarray analysis. Chromosoma 116, 147–157 (2007).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  29. 29.

    Engreitz, J. M. et al. RNA-RNA interactions enable specific targeting of noncoding RNAs to nascent Pre-mRNAs and chromatin sites. Cell 159, 188–199 (2014).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  30. 30.

    Mak, W. et al. Mitotically stable association of polycomb group proteins eed and enx1 with the inactive x chromosome in trophoblast stem cells. Curr. Biol. 12, 1016–1020 (2002).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  31. 31.

    West, J. A. et al. The long noncoding RNAs NEAT1 and MALAT1 bind active chromatin sites. Mol. Cell 55, 791–802 (2014).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  32. 32.

    Clemson, C. M., McNeil, J. A., Willard, H. F. & Lawrence, J. B. XIST RNA paints the inactive X chromosome at interphase: evidence for a novel RNA involved in nuclear/chromosome structure. J. Cell. Biol. 132, 259–275 (1996).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  33. 33.

    Calabrese, J. M. et al. Site-specific silencing of regulatory elements as a mechanism of X inactivation. Cell 151, 951–963 (2012).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  34. 34.

    Blondel, V. D., Guillaume, J. L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory E. https://doi.org/10.1088/1742-5468/2008/10/P10008 (2008).

    Article  Google Scholar 

  35. 35.

    Dunham, I. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    CAS  Article  Google Scholar 

  36. 36.

    Carlevaro-Fita, J., Rahim, A., Guigo, R., Vardy, L. A. & Johnson, R. Cytoplasmic long noncoding RNAs are frequently bound to and degraded at ribosomes in human cells. RNA 22, 867–882 (2016).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  37. 37.

    Van Nostrand, E. L. et al. Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP). Nat. Methods 13, 508–514 (2016).

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  38. 38.

    Hawkins, D. M. The problem of overfitting. J. Chem. Inf. Comput. Sci. 44, 1–12 (2004).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  39. 39.

    Spitale, R. C. et al. Structural imprints in vivo decode RNA regulatory mechanisms. Nature 519, 486–490 (2015).

    Article  PubMed Central  CAS  Google Scholar 

  40. 40.

    Lambert, N. et al. RNA Bind-n-Seq: quantitative assessment of the sequence and structural binding specificity of RNA binding proteins. Mol. Cell 54, 887–900 (2014).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  41. 41.

    Smola, M. J. et al. SHAPE reveals transcript-wide interactions, complex structural domains, and protein interactions across the Xist lncRNA in living cells. Proc. Natl Acad. Sci. USA 113, 10322–10327 (2016).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  42. 42.

    Di Matteo, M. et al. PiggyBac toolbox. Methods Mol. Biol. 859, 241–254 (2012).

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  43. 43.

    Ding, S. et al. Efficient transposition of the piggyBac (PB) transposon in mammalian cells and mice. Cell 122, 473–483 (2005).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  44. 44.

    Dowen, J. M. et al. Control of cell identity genes occurs in insulated neighborhoods in mammalian chromosomes. Cell 159, 374–387 (2014).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  45. 45.

    Wutz, A., Rasmussen, T. P. & Jaenisch, R. Chromosomal silencing and localization are mediated by different domains of Xist RNA. Nat. Genet. 30, 167–174 (2002).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  46. 46.

    Liu, F., Somarowthu, S. & Pyle, A. M. Visualizing the secondary and tertiary architectural domains of lncRNA RepA. Nat. Chem. Biol. 13, 282–289 (2017).

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  47. 47.

    Tyner, C. et al. The UCSC Genome Browser database: 2017 update. Nucleic Acids Res. 45, D626–D634 (2017).

    CAS  PubMed  Google Scholar 

  48. 48.

    The R Core Team. R: a Language and Environment for Statistical Computing (The R Foundation for Statistical Computing, 2017).

  49. 49.

    Saldanha, A. J. Java Treeview—Extensible visualization of microarray data. Bioinformatics 20, 3246–3248 (2004).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  50. 50.

    Weir, W. H., Emmons, S., Gibson, R., Taylor, D. & Mucha, P. J. Post-processing partitions to identify domains of modularity optimization. Algorithms 10, 93 (2017).

    Article  PubMed Central  PubMed  Google Scholar 

  51. 51.

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  52. 52.

    Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  53. 53.

    Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

  54. 54.

    Bailey, T. L. et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37, W202–W208 (2009).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  55. 55.

    Machanick, P. & Bailey, T. L. MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 27, 1696–1697 (2011).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

  56. 56.

    Darty, K., Denise, A. & Ponty, Y. VARNA: interactive drawing and editing of the RNA secondary structure. Bioinformatics 25, (1974–1975 (2009).

    Google Scholar 

  57. 57.

    Busan, S. & Weeks, K. M. Visualization of RNA structure models within the Integrative Genomics Viewer. RNA 23, 1012–1018 (2017).

    CAS  Article  PubMed Central  PubMed  Google Scholar 

Download references

Acknowledgements

We thank UNC colleagues for discussions, and J. Cheng for help with TETRIS cloning. This work was supported by National Institutes of Health (NIH) Grants UL1TR002489, GM121806, and GM105785, Basil O’Connor Award no. 5100683 from the March of Dimes Foundation, and funds from the Eshelman Institute for Innovation, the Lineberger Comprehensive Cancer Center and the UNC Department of Pharmacology (J.M.C.), the James S. McDonnell Foundation 21st Century Science Initiative–Complex Systems Scholar Award Grant no. 220020315 (P.J.M.), and NIH MIRA award R35 GM122532 (K.M.W.). J.M.K. is an NSF Graduate Research Fellow (Grant DGE-1650116) and was supported in part by an NIH training grant in bioinformatics and computational biology (T32 GM067553). D.M.L. was supported in part by an NIH training grant in genetics and molecular biology (T32 GM007092). M.J.S. was an NSF Graduate Research Fellow (Grant DGE-1144081) and was supported in part by an NIH training grant in molecular and cellular biophysics (Grant T32 GM08570).

Author information

Affiliations

Authors

Contributions

J.M.K., P.J.M., and J.M.C. conceived the study. J.M.K., D.S., and J.M.C. performed the computational analysis. S.O.K., K.I., D.M.L., M.D.S., J.S.W., A.R.B., K.M.W., and J.M.C. designed and performed the TETRIS assays. D.W.C., C.R.H., S.W., Q.C., and J.M.K. built the website. J.M.K. and J.M.C. wrote the paper.

Corresponding author

Correspondence to J. Mauro Calabrese.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–11 and Supplementary Tables 2–6, 9, 10, 13–17 and 21

Reporting Summary

Supplementary Table 1

List of curated cis-regulatory lncRNAs in human and mouse

Supplementary Table 7

Human lncRNA community assignments and descriptions

Supplementary Table 8

Mouse lncRNA community assignments and descriptions

Supplementary Table 11

Human community k-mer profiles

Supplementary Table 12

Mouse community k-mer profiles

Supplementary Table 18

k-mer abundance in nuclear and cytosolic lncRNAs

Supplementary Table 19

Protein log-likelihood results comparing the predictive power of null versus full logistic regression models

Supplementary Table 20

Protein logistic regression (LR) precision and recall results

Supplementary Table 22

TETRIS-lncRNA fragment information

Supplementary Table 23

Oligonucleotide primers for the TETRIS assay

Supplementary Software

A library for counting small k-mer frequencies in nucleotide sequences

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kirk, J.M., Kim, S.O., Inoue, K. et al. Functional classification of long non-coding RNAs by k-mer content. Nat Genet 50, 1474–1482 (2018). https://doi.org/10.1038/s41588-018-0207-8

Download citation

Further reading

Search

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing