Protocol | Published:

Pipelines for cross-species and genome-wide prediction of long noncoding RNA binding

Nature Protocols (2019) | Download Citation

Abstract

Abundant long, noncoding RNAs (lncRNAs) in mammals can bind to DNA sequences and recruit histone- and DNA-modifying enzymes to binding sites to epigenetically regulate target genes. However, most lncRNAs’ binding motifs and target sites are unknown. The large numbers of lncRNAs and target sites in the whole genome make it infeasible to examine lncRNA binding to DNA purely experimentally. Here, we report a protocol for lncRNA/DNA-binding analysis that is built upon a database containing the GENCODE-annotated human and mouse lncRNAs, the orthologs of these lncRNAs in 17 mammals, and the genome sequences of the 17 mammals. Cross-species and genome-wide lncRNA/DNA-binding analysis begins with and is driven by database search. The predicted DNA-binding motifs and binding sites answer the general question of which lncRNAs may epigenetically regulate which genes, and can be used to identify potential sites for genome and epigenome editing. To use the protocol, preliminary knowledge of the base-pairing rules that guide the binding of noncoding RNAs to DNA to form triplexes, as well as the skills required to use the UCSC Genome Browser, are needed. A genome-wide prediction takes from 2 to 10 d, and the results are sent to users automatically by e-mail. The platform is updated continuously, making it possible to study more lncRNAs and larger genomic regions in less computational time.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Code availability

The source code is available at our website (http://lncRNA.smu.edu.cn) and the GitHub website (https://github.com/LongTarget/) under a GNU Affero General Public License v.3.0.

Data availability

Example datasets that include TFOclass1 and TFOsorted files of all examples are available at our website (http://lncRNA.smu.edu.cn) and the GitHub website (https://github.com/LongTarget/).

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Key references using this protocol

He, S., Zhang, H., Liu, H. & Zhu, H. Bioinformatics 31, 178–186 (2015): https://doi.org/10.1093/bioinformatics/btu643

Liu, H., Shang, X. & Zhu, H. Bioinformatics 33, 1431–1436 (2017): https://doi.org/10.1093/bioinformatics/btw818

Wang, S. et al. Cell Death Dis. 9, 805 (2018): https://doi.org/10.1038/s41419-018-0869-2

References

  1. 1.

    Maeda, N. et al. Transcript annotation in FANTOM3: mouse gene catalog based on physical cDNAs. PLoS Genet. 2, e62 (2006).

  2. 2.

    Guttman, M. et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223–227 (2009).

  3. 3.

    Khalil, A. M. et al. Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression. Proc. Natl Acad. Sci. USA 106, 11667–11672 (2009).

  4. 4.

    Jia, H. et al. Genome-wide computational identification and manual annotation of human long noncoding RNA genes. RNA 16, 1478–1487 (2010).

  5. 5.

    Kapranov, P. et al. The majority of total nuclear-encoded non-ribosomal RNA in a human cell is ‘dark matter’ un-annotated RNA. BMC Biol. 8, 149 (2010).

  6. 6.

    Cabili, M. N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).

  7. 7.

    Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 22, 1775–1789 (2012).

  8. 8.

    Yue, F. et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355–364 (2014).

  9. 9.

    Ulitsky, I. Evolution to the rescue: using comparative genomics to understand long non-coding RNAs. Nat. Rev. Genet. 17, 601–614 (2016).

  10. 10.

    Zhao, J., Sun, B. K., Erwin, J. A., Song, J. J. & Lee, J. T. Polycomb proteins targeted by a short repeat RNA to the mouse X chromosome. Science 322, 750–756 (2008).

  11. 11.

    Tsai, M. C. et al. Long noncoding RNA as modular scaffold of histone modification complexes. Science 329, 689–693 (2010).

  12. 12.

    Monnier, P. et al. H19 lncRNA controls gene expression of the Imprinted Gene Network by recruiting MBD1. Proc. Natl Acad. Sci. USA 110, 20693–20698 (2013).

  13. 13.

    Lee, J. T. Lessons from X-chromosome inactivation: long ncRNA as guides and tethers to the epigenome. Genes Dev. 23, 1831–1842 (2009).

  14. 14.

    Buske, F. A., Bauer, D. C., Mattick, J. S. & Bailey, T. L. Triplexator: detecting nucleic acid triple helices in genomic and transcriptomic data. Genome Res. 22, 1372–1381 (2012).

  15. 15.

    He, S., Zhang, H., Liu, H. & Zhu, H. LongTarget: a tool to predict lncRNA DNA-binding motifs and binding sites via Hoogsteen base-pairing analysis. Bioinformatics 31, 178–186 (2015).

  16. 16.

    Meissner, A. et al. Genome-scale DNA methylation maps of pluripotent and differentiated cells. Nature 454, 766–770 (2008).

  17. 17.

    Ram, O. et al. Combinatorial patterning of chromatin regulators uncovered by genome-wide location analysis in human cells. Cell 147, 1628–1639 (2011).

  18. 18.

    GTEx Consortium. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).

  19. 19.

    Rigoutsos, I. et al. N-BLR, a primate-specific non-coding transcript leads to colorectal cancer invasion and migration. Genome Biol. 18, 98 (2017).

  20. 20.

    Liu, H., Shang, X. & Zhu, H. LncRNA/DNA binding analysis reveals losses and gains and lineage specificity of genomic imprinting in mammals. Bioinformatics 33, 1431–1436 (2017).

  21. 21.

    Abu Almakarem, A. S., Petrov, A. I., Stombaugh, J., Zirbel, C. L. & Leontis, N. B. Comprehensive survey and geometric classification of base triples in RNA structures. Nucleic Acids Res. 40, 1407–1423 (2012).

  22. 22.

    Kotake, Y. et al. Long non-coding RNA ANRIL is required for the PRC2 recruitment to and silencing ofp15(INK4B) tumor suppressor gene. Oncogene 30, 1956–1962 (2011).

  23. 23.

    Gabory, A. et al. H19 acts as a trans regulator of the imprinted gene network controlling growth in mice. Development 136, 3413–3421 (2009).

  24. 24.

    Lun, A. T., Chen, Y. & Smyth, G. K. It’s DE-licious: a recipe for differential expression analyses of RNA-seq experiments using quasi-likelihood methods in edgeR. Methods Mol. Biol. 1418, 391–416 (2016).

  25. 25.

    Flavahan, W. A., Gaskell, E. & Bernstein, B. E. Epigenetic plasticity and the hallmarks of cancer. Science 357, eaal2380 (2017).

  26. 26.

    Baran, Y. et al. The landscape of genomic imprinting across diverse adult human tissues. Genome Res. 25, 927–936 (2015).

  27. 27.

    Li, L., Helms, J. A. & Chang, H. Y. Comment on “Hotair Is Dispensable for Mouse Development”. PLoS Genet. 12, e1006406 (2016).

  28. 28.

    Amandio, A. R., Necsulea, A., Joye, E., Mascrez, B. & Duboule, D. Hotair Is Dispensible for Mouse Development. PLoS Genet. 12, e1006232 (2016).

  29. 29.

    Dinger, M. E. et al. NRED: a database of long noncoding RNA expression. Nucleic Acids Res. 37, D122–D126 (2009).

  30. 30.

    Amaral, P. P., Clark, M. B., Gascoigne, D. K., Dinger, M. E. & Mattick, J. S. lncRNAdb: a reference database for long noncoding RNAs. Nucleic Acids Res. 39, D146–D151 (2011).

  31. 31.

    Chen, G. et al. LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucleic Acids Res. 41, D983–D986 (2013).

  32. 32.

    Fang, S. et al. NONCODEV5: a comprehensive annotation database for long non-coding RNAs. Nucleic Acids Res. 46, D308–D314 (2018).

  33. 33.

    Yu, W. et al. Epigenetic silencing of tumour suppressor gene p15 by its antisense RNA. Nature 451, 202–206 (2008).

  34. 34.

    Yap, K. L. et al. Molecular interplay of the noncoding RNA ANRIL and methylated histone H3 lysine 27 by polycomb CBX7 in transcriptional silencing of INK4a. Mol. Cell 38, 662–674 (2010).

  35. 35.

    Luo, M. et al. A genome-wide survey of imprinted genes in rice seeds reveals imprinting primarily occurs in the endosperm. PLoS Genet. 7, e1002125 (2011).

  36. 36.

    Smits, G. et al. Conservation of the H19 noncoding RNA and H19-IGF2 imprinting mechanism in therians. Nat. Genet. 40, 971–976 (2008).

  37. 37.

    Barlow, D. P. & Bartolomei, M. S. Genomic imprinting in mammals. Cold Spring Harb. Perspect. Biol. 6, a018382 (2014).

  38. 38.

    Johnsson, P. et al. A pseudogene long-noncoding-RNA network regulates PTEN transcription and translation in human cells. Nat. Struct. Mol. Biol. 20, 440–446 (2013).

  39. 39.

    Lister, N. et al. The molecular dynamics of long noncoding RNA control of transcription in PTEN and its pseudogene. Proc. Natl Acad. Sci. USA 114, 9942–9947 (2017).

  40. 40.

    He, S., Gu, W., Li, Y. & Zhu, H. ANRIL/CDKN2B-AS shows two-stage clade-specific evolution and becomes conserved after transposon insertions in simians. BMC Evol. Biol. 13, 247 (2013).

  41. 41.

    Pasmant, E. et al. Characterization of a germ-line deletion, including the entire INK4/ARF locus, in a melanoma-neural system tumor family: identification of ANRIL, an antisense noncoding RNA whose expression coclusters with ARF. Cancer Res. 67, 3963–3969 (2007).

Download references

Acknowledgements

This work received financial support (to H. Zhu) from the NSFC (31571348 and 31771456), the Special Program for Applied Research on SuperComputation of the NSFC-Guangdong Joint Fund, and the Guangzhou Science and Technology Innovation Committee (201607010067).

Author information

Author notes

  1. These authors contributed equally: Jie Lin, Yujian Wen and Sha He.

Affiliations

  1. Bioinformatics Section, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China

    • Jie Lin
    • , Yujian Wen
    • , Sha He
    • , Xiaoxue Yang
    •  & Hao Zhu
  2. Network and Information Technology Center, Southern Medical University, Guangzhou, China

    • Hai Zhang

Authors

  1. Search for Jie Lin in:

  2. Search for Yujian Wen in:

  3. Search for Sha He in:

  4. Search for Xiaoxue Yang in:

  5. Search for Hai Zhang in:

  6. Search for Hao Zhu in:

Contributions

S.H. and X.Y. performed the genome searches; H. Zhang and J.L. built the database; H. Zhang, Y.W., and H. Zhu revised the LongTarget code; H. Zhu designed the study, analyzed the data, and drafted the manuscript.

Competing interests

The authors declare no competing interests.

Corresponding authors

Correspondence to Hai Zhang or Hao Zhu.

Integrated supplementary information

  1. Supplementary Figure 1 The coordinates of some TTSs of RP11-375H19.2 collected in the database.

    To open this window that shows TTSs of TFO1, click the blue TFO1 button in the webpage shown in Fig. 9.

  2. Supplementary Figure 2 The initial records in the Excel file that reports the TTS distribution of H19 at all transcripts in the human genome hg38.

    “bs” means binding site. bs_chr, bs_start, and bs_end indicate the chromosome number, start coordinate, and end coordinate of a TTS. TTS_area is the area of the peak of a TTS (as shown in custom tracks of TTS distributions) and indicates the strength of the TTS.

  3. Supplementary Figure 3 The expression level of CDKN2B-AS1 in 53 tissues.

    The picture is obtained from the GTEx Gene track in the UCSC Genome Browser by choosing the GTEx Transcript track in the Expression section to display GTEx genes graphically.

  4. Supplementary Figure 4 The TTS distributions of marmoset lncRNAs in the marmoset CDKN2A/2B locus.

    From top to bottom are 14 custom tracks of the TTS distribution of 14 lncRNAs, the track of Ensembl Genes, the custom track of CDKN2B-AS1_Marmoset, and the track of RepeatMasker. Three TTSs at transposable elements of Simple or Low Complexity are marked by three blue vertical lines. The results indicate that, as in humans, many lncRNAs bind to promoters of CDKN2A/2B.

  5. Supplementary Figure 5 The TTS distributions of mouse lncRNAs in the mouse CDKN2A/2B locus.

    Some TTSs at promoters and CpG islands (in green) are marked in yellow, and some TTSs at transposable elements (in the RepeatMasker track) and repetitive elements (in the SimpleRepeats track) are marked in blue. Some lncRNAs have TTSs only at transposable and/or repetitive elements.

  6. Supplementary Figure 6 The TTS distributions of H19 and other human lncRNAs in the human IGF2 region.

    From top to bottom are custom tracks of the TTS distribution of 16 lncRNAs, UCSC Genes, CpG Islands, ENCODE DNA Methylation (the colored lines indicate DNA methylation signals), and ENCODE Histone Modification (the colored areas indicate histone modification signals). This figure indicates that many lncRNAs may bind to the IGF2 region at the site H19 binds to.

  7. Supplementary Figure 7 Exons of the orthologue of human CDKN2B-AS1 in marmoset.

    This webpage shows the coordinates of all exons of the orthologue of human CDKN2B-AS1 in marmoset.

  8. Supplementary Figure 8 The format of the custom gene track file marmoset-CDKN2B-AS1.

    The gene name is CDKN2B-AS1_Marmoset as shown in Supplementary Figure 4. All custom gene track files should follow the same format, but can adopt any file name.

  9. Supplementary Figure 9 The results of a M:N case of genome-wide prediction.

    The figure shows some of the records in the Excel file that reports TTSs of lncRNA transcripts at the genomic regions of protein-coding transcripts. Here peak_area and TTS_area are defined as in Supplementary Figure 2.

  10. Supplementary Figure 10 Two ENST ID lists.

    The two ENST ID lists are the inputs of the M:N case of genome-wide prediction shown in Supplementary Figure 9. These Ensembl ENST IDs are differentially expressed lncRNA transcripts and differentially expressed protein-coding transcripts from an RNA-seq analysis we made that compare the gene expression in 12 human colorectal cancer tissues with the gene expression in 3 normal colorectal tissues (unpublished observations [Sha He, Yujian Wen, Hao Zhu]). These transcripts were assembled using reads of RNA-seq by the StringTie program (Nat. Biotechnol. 33, 290–295; 2015) and differential expression was determined by the EdgeR program (Genome Biol. 17, 75; 2016).

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figures 1–10 and Supplementary Methods

  2. Reporting Summary

About this article

Publication history

Published

DOI

https://doi.org/10.1038/s41596-018-0115-5

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.