The functions of most long non-coding RNAs (lncRNAs) are unknown. In contrast to proteins, lncRNAs with similar functions often lack linear sequence homology; thus, the identification of function in one lncRNA rarely informs the identification of function in others. We developed a sequence comparison method to deconstruct linear sequence relationships in lncRNAs and evaluate similarity based on the abundance of short motifs called k-mers. We found that lncRNAs of related function often had similar k-mer profiles despite lacking linear homology, and that k-mer profiles correlated with protein binding to lncRNAs and with their subcellular localization. Using a novel assay to quantify Xist-like regulatory potential, we directly demonstrated that evolutionarily unrelated lncRNAs can encode similar function through different spatial arrangements of related sequence motifs. K-mer-based classification is a powerful approach to detect recurrent relationships between sequence and function in lncRNAs.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Data availability

The datasets generated during and/or analyzed during the current study are available within the article and its supplementary information files.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


  1. 1.

    Iyer, M. K. et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47, 199–208 (2015).

  2. 2.

    Geisler, S. & Coller, J. RNA in unexpected places: long non-coding RNA functions in diverse cellular contexts. Nat. Rev. Mol. Cell Biol. 14, 699–712 (2013).

  3. 3.

    Holoch, D. & Moazed, D. RNA-mediated epigenetic regulation of gene expression. Nat. Rev. Genet. 16, 71–84 (2015).

  4. 4.

    Liu, X., Hao, L., Li, D., Zhu, L. & Hu, S. Long non-coding RNAs and their biological roles in plants. Genomics Proteomics Bioinformatics 13, 137–147 (2015).

  5. 5.

    Rinn, J. L. & Chang, H. Y. Genome regulation by long noncoding RNAs. Annu. Rev. Biochem. 81, 145–166 (2012).

  6. 6.

    Gutschner, T. & Diederichs, S. The hallmarks of cancer: a long non-coding RNA point of view. RNA Biol. 9, 703–719 (2012).

  7. 7.

    Lee, J. T. & Bartolomei, M. S. X-inactivation, imprinting, and long noncoding RNAs in health and disease. Cell 152, 1308–1323 (2013).

  8. 8.

    Wu, X. & Sharp, P. A. Divergent transcription: a driving force for new gene origination? Cell 155, 990–996 (2013).

  9. 9.

    Cech, T. R. & Steitz, J. A. The noncoding RNA revolution-trashing old rules to forge new ones. Cell 157, 77–94 (2014).

  10. 10.

    Hezroni, H. et al. Principles of long noncoding RNA evolution derived from direct comparison of transcriptomes in 17 species. Cell Rep. 11, 1110–1122 (2015).

  11. 11.

    Cabili, M. N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).

  12. 12.

    Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).

  13. 13.

    Bateman, A. et al. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).

  14. 14.

    Berman, H., Henrick, K. & Nakamura, H. Announcing the worldwide Protein Data Bank. Nat. Struct. Biol. 10, 980 (2003).

  15. 15.

    Ulitsky, I. & Bartel, D. P. lincRNAs: genomics, evolution, and mechanisms. Cell 154, 26–46 (2013).

  16. 16.

    Kutter, C. et al. Rapid turnover of long noncoding RNAs and the evolution of gene expression. PLoS Genet. 8, e1002841 (2012).

  17. 17.

    Necsulea, A. et al. The evolution of lncRNA repertoires and expression patterns in tetrapods. Nature 505, 635–640 (2014).

  18. 18.

    Eddy, S. R. Computational analysis of conserved RNA secondary structure in transcriptomes and genomes. Annu. Rev. Biophys. 43, 433–456 (2014).

  19. 19.

    Quinn, J. J. et al. Rapid evolutionary turnover underlies conserved lncRNA-genome interactions. Genes Dev. 30, 191–207 (2016).

  20. 20.

    Eddy, S. R. Homology searches for structural RNAs: from proof of principle to practical use. RNA 21, 605–607 (2015).

  21. 21.

    Wheeler, T. J. & Eddy, S. R. nhmmer: DNA homology search with profile HMMs. Bioinformatics 29, 2487–2489 (2013).

  22. 22.

    Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276–277 (2000).

  23. 23.

    Ray, D. et al. A compendium of RNA-binding motifs for decoding gene regulation. Nature 499, 172–177 (2013).

  24. 24.

    Stefl, R., Skrisovska, L. & Allain, F. H. RNA sequence- and shape-dependent recognition by proteins in the ribonucleoprotein particle. EMBO Rep. 6, 33–38 (2005).

  25. 25.

    Edgar, R. C. & Batzoglou, S. Multiple sequence alignment. Curr. Opin. Struc. Biol. 16, 368–373 (2006).

  26. 26.

    Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

  27. 27.

    Pervouchine, D. D. et al. Enhanced transcriptome maps from multiple mouse tissues reveal evolutionary constraint in gene expression. Nat. Commun. 6, 5903 (2015).

  28. 28.

    Chadwick, B. P. Variation in Xi chromatin organization and correlation of the H3K27me3 chromatin territories to transcribed sequences by microarray analysis. Chromosoma 116, 147–157 (2007).

  29. 29.

    Engreitz, J. M. et al. RNA-RNA interactions enable specific targeting of noncoding RNAs to nascent Pre-mRNAs and chromatin sites. Cell 159, 188–199 (2014).

  30. 30.

    Mak, W. et al. Mitotically stable association of polycomb group proteins eed and enx1 with the inactive x chromosome in trophoblast stem cells. Curr. Biol. 12, 1016–1020 (2002).

  31. 31.

    West, J. A. et al. The long noncoding RNAs NEAT1 and MALAT1 bind active chromatin sites. Mol. Cell 55, 791–802 (2014).

  32. 32.

    Clemson, C. M., McNeil, J. A., Willard, H. F. & Lawrence, J. B. XIST RNA paints the inactive X chromosome at interphase: evidence for a novel RNA involved in nuclear/chromosome structure. J. Cell. Biol. 132, 259–275 (1996).

  33. 33.

    Calabrese, J. M. et al. Site-specific silencing of regulatory elements as a mechanism of X inactivation. Cell 151, 951–963 (2012).

  34. 34.

    Blondel, V. D., Guillaume, J. L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory E. https://doi.org/10.1088/1742-5468/2008/10/P10008 (2008).

  35. 35.

    Dunham, I. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  36. 36.

    Carlevaro-Fita, J., Rahim, A., Guigo, R., Vardy, L. A. & Johnson, R. Cytoplasmic long noncoding RNAs are frequently bound to and degraded at ribosomes in human cells. RNA 22, 867–882 (2016).

  37. 37.

    Van Nostrand, E. L. et al. Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP). Nat. Methods 13, 508–514 (2016).

  38. 38.

    Hawkins, D. M. The problem of overfitting. J. Chem. Inf. Comput. Sci. 44, 1–12 (2004).

  39. 39.

    Spitale, R. C. et al. Structural imprints in vivo decode RNA regulatory mechanisms. Nature 519, 486–490 (2015).

  40. 40.

    Lambert, N. et al. RNA Bind-n-Seq: quantitative assessment of the sequence and structural binding specificity of RNA binding proteins. Mol. Cell 54, 887–900 (2014).

  41. 41.

    Smola, M. J. et al. SHAPE reveals transcript-wide interactions, complex structural domains, and protein interactions across the Xist lncRNA in living cells. Proc. Natl Acad. Sci. USA 113, 10322–10327 (2016).

  42. 42.

    Di Matteo, M. et al. PiggyBac toolbox. Methods Mol. Biol. 859, 241–254 (2012).

  43. 43.

    Ding, S. et al. Efficient transposition of the piggyBac (PB) transposon in mammalian cells and mice. Cell 122, 473–483 (2005).

  44. 44.

    Dowen, J. M. et al. Control of cell identity genes occurs in insulated neighborhoods in mammalian chromosomes. Cell 159, 374–387 (2014).

  45. 45.

    Wutz, A., Rasmussen, T. P. & Jaenisch, R. Chromosomal silencing and localization are mediated by different domains of Xist RNA. Nat. Genet. 30, 167–174 (2002).

  46. 46.

    Liu, F., Somarowthu, S. & Pyle, A. M. Visualizing the secondary and tertiary architectural domains of lncRNA RepA. Nat. Chem. Biol. 13, 282–289 (2017).

  47. 47.

    Tyner, C. et al. The UCSC Genome Browser database: 2017 update. Nucleic Acids Res. 45, D626–D634 (2017).

  48. 48.

    The R Core Team. R: a Language and Environment for Statistical Computing (The R Foundation for Statistical Computing, 2017).

  49. 49.

    Saldanha, A. J. Java Treeview—Extensible visualization of microarray data. Bioinformatics 20, 3246–3248 (2004).

  50. 50.

    Weir, W. H., Emmons, S., Gibson, R., Taylor, D. & Mucha, P. J. Post-processing partitions to identify domains of modularity optimization. Algorithms 10, 93 (2017).

  51. 51.

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

  52. 52.

    Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).

  53. 53.

    Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

  54. 54.

    Bailey, T. L. et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37, W202–W208 (2009).

  55. 55.

    Machanick, P. & Bailey, T. L. MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 27, 1696–1697 (2011).

  56. 56.

    Darty, K., Denise, A. & Ponty, Y. VARNA: interactive drawing and editing of the RNA secondary structure. Bioinformatics 25, (1974–1975 (2009).

  57. 57.

    Busan, S. & Weeks, K. M. Visualization of RNA structure models within the Integrative Genomics Viewer. RNA 23, 1012–1018 (2017).

Download references


We thank UNC colleagues for discussions, and J. Cheng for help with TETRIS cloning. This work was supported by National Institutes of Health (NIH) Grants UL1TR002489, GM121806, and GM105785, Basil O’Connor Award no. 5100683 from the March of Dimes Foundation, and funds from the Eshelman Institute for Innovation, the Lineberger Comprehensive Cancer Center and the UNC Department of Pharmacology (J.M.C.), the James S. McDonnell Foundation 21st Century Science Initiative–Complex Systems Scholar Award Grant no. 220020315 (P.J.M.), and NIH MIRA award R35 GM122532 (K.M.W.). J.M.K. is an NSF Graduate Research Fellow (Grant DGE-1650116) and was supported in part by an NIH training grant in bioinformatics and computational biology (T32 GM067553). D.M.L. was supported in part by an NIH training grant in genetics and molecular biology (T32 GM007092). M.J.S. was an NSF Graduate Research Fellow (Grant DGE-1144081) and was supported in part by an NIH training grant in molecular and cellular biophysics (Grant T32 GM08570).

Author information

Author notes

    • Susan O. Kim
    •  & Kaoru Inoue

    Present address: National Institute of Environmental Health Sciences, Research Triangle Park, NC, USA

    • Matthew J. Smola

    Present address: Ribometrix, Durham, NC, USA

    • Allison R. Baker

    Present address: Harvard Medical School, Ph.D. Program in Biological and Biomedical Sciences, Boston, MA, USA


  1. Department of Pharmacology and Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

    • Jessime M. Kirk
    • , Susan O. Kim
    • , Kaoru Inoue
    • , David M. Lee
    • , Megan D. Schertzer
    • , Joshua S. Wooten
    • , Allison R. Baker
    • , Daniel Sprague
    •  & J. Mauro Calabrese
  2. Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

    • Jessime M. Kirk
  3. Department of Chemistry, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

    • Matthew J. Smola
    •  & Kevin M. Weeks
  4. Curriculum in Genetics and Molecular Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

    • David M. Lee
    • , Megan D. Schertzer
    •  & Joshua S. Wooten
  5. Curriculum in Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

    • Daniel Sprague
  6. Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

    • David W. Collins
    • , Christopher R. Horning
    • , Shuo Wang
    •  & Qidi Chen
  7. Carolina Center for Interdisciplinary Applied Mathematics, Department of Mathematics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

    • Peter J. Mucha


  1. Search for Jessime M. Kirk in:

  2. Search for Susan O. Kim in:

  3. Search for Kaoru Inoue in:

  4. Search for Matthew J. Smola in:

  5. Search for David M. Lee in:

  6. Search for Megan D. Schertzer in:

  7. Search for Joshua S. Wooten in:

  8. Search for Allison R. Baker in:

  9. Search for Daniel Sprague in:

  10. Search for David W. Collins in:

  11. Search for Christopher R. Horning in:

  12. Search for Shuo Wang in:

  13. Search for Qidi Chen in:

  14. Search for Kevin M. Weeks in:

  15. Search for Peter J. Mucha in:

  16. Search for J. Mauro Calabrese in:


J.M.K., P.J.M., and J.M.C. conceived the study. J.M.K., D.S., and J.M.C. performed the computational analysis. S.O.K., K.I., D.M.L., M.D.S., J.S.W., A.R.B., K.M.W., and J.M.C. designed and performed the TETRIS assays. D.W.C., C.R.H., S.W., Q.C., and J.M.K. built the website. J.M.K. and J.M.C. wrote the paper.

Competing interests

The authors declare no competing interests.

Corresponding author

Correspondence to J. Mauro Calabrese.

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figures 1–11 and Supplementary Tables 2–6, 9, 10, 13–17 and 21

  2. Reporting Summary

  3. Supplementary Table 1

    List of curated cis-regulatory lncRNAs in human and mouse

  4. Supplementary Table 7

    Human lncRNA community assignments and descriptions

  5. Supplementary Table 8

    Mouse lncRNA community assignments and descriptions

  6. Supplementary Table 11

    Human community k-mer profiles

  7. Supplementary Table 12

    Mouse community k-mer profiles

  8. Supplementary Table 18

    k-mer abundance in nuclear and cytosolic lncRNAs

  9. Supplementary Table 19

    Protein log-likelihood results comparing the predictive power of null versus full logistic regression models

  10. Supplementary Table 20

    Protein logistic regression (LR) precision and recall results

  11. Supplementary Table 22

    TETRIS-lncRNA fragment information

  12. Supplementary Table 23

    Oligonucleotide primers for the TETRIS assay

  13. Supplementary Software

    A library for counting small k-mer frequencies in nucleotide sequences

About this article

Publication history