Prediction of the cell-type-specific transcription of non-coding RNAs from genome sequences via machine learning

Koido, Masaru; Hon, Chung-Chau; Koyama, Satoshi; Kawaji, Hideya; Murakawa, Yasuhiro; Ishigaki, Kazuyoshi; Ito, Kaoru; Sese, Jun; Parrish, Nicholas F.; Kamatani, Yoichiro; Carninci, Piero; Terao, Chikashi

doi:10.1038/s41551-022-00961-8

Article
Published: 21 November 2022

Prediction of the cell-type-specific transcription of non-coding RNAs from genome sequences via machine learning

Nature Biomedical Engineering volume 7, pages 830–844 (2023)Cite this article

4790 Accesses
8 Citations
77 Altmetric
Metrics details

Subjects

Abstract

Gene transcription is regulated through complex mechanisms involving non-coding RNAs (ncRNAs). As the transcription of ncRNAs, especially of enhancer RNAs, is often low and cell type specific, how the levels of RNA transcription depend on genotype remains largely unexplored. Here we report the development and utility of a machine-learning model (MENTR) that reliably links genome sequence and ncRNA expression at the cell type level. Effects on ncRNA transcription predicted by the model were concordant with estimates from published studies in a cell-type-dependent manner, regardless of allele frequency and genetic linkage. Among 41,223 variants from genome-wide association studies, the model identified 7,775 enhancer RNAs and 3,548 long ncRNAs causally associated with complex traits across 348 major human primary cells and tissues, such as rare variants plausibly altering the transcription of enhancer RNAs to influence the risks of Crohn’s disease and asthma. The model may aid the discovery of causal variants and the generation of testable hypotheses for biological mechanisms driving complex traits.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Strategy to find the effects of variants on ncRNA transcription.**

**Fig. 2: Cell-type-specific prediction of promoter-level and enhancer-level expression.**

**Fig. 3: Comparison between MENTR and Basenji.**

**Fig. 4: Accurate predictions of mutation effects on ncRNA expression in a cell-type-dependent manner.**

**Fig. 5: Predictions to explain and prioritize GWAS findings.**

**Fig. 6: Finding a candidate of a causal variant in Crohn’s disease by linking variants with transcribed enhancers in relevant cell types.**

**Fig. 7: Diagram of the interpretation of rare variants associated with complex traits.**

Predicting genes associated with RNA methylation pathways using machine learning

Article Open access 25 August 2022

Effective gene expression prediction from sequence by integrating long-range interactions

Article Open access 04 October 2021

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Data availability

The GWAS trait-associated ncRNA database is available via a user-friendly graphical user interface application (https://doi.org/10.5281/zenodo.5638259); basic-usage information is provided in the figure legend of Supplementary Fig. 11. Supplementary files for training MENTR ML models and the pre-trained MENTR ML models are available at https://github.com/koido/MENTR (https://github.com/koido/MENTR/wiki provides information on how to use them; large files are available at https://doi.org/10.5281/zenodo.5348471). Publicly available datasets that we used in the study are as follows: LCL CAGE transcriptomes²², https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-5835; GWAS catalogue⁵⁴ (r2019-07-12), https://www.ebi.ac.uk/gwas/downloads; 95% credible sets in 139 independent associated regions in IBD, supplementary data from ref. 31; representative TSS of CAGE peaks (promoters) in FANTOM5, https://fantom.gsc.riken.jp/5/datafiles/phase2.5/extra/CAGE_peaks/hg19.cage_peak_phase1and2combined_coord.bed.gz; inferred mid position of enhancers in FANTOM5, https://fantom.gsc.riken.jp/5/datafiles/phase2.5/extra/Enhancers/human_permissive_enhancers_phase_1_and_2.bed.gz; CAGE peak annotations in FANTOM5, https://fantom.gsc.riken.jp/5/datafiles/phase2.5/extra/CAGE_peaks_annotation; caQTL from LCL ATAC-seq, supplementary data from ref. 23; 5,376 puQTL and 110 eaQTL from LCL CAGE, datasets from ref. 22 via personal communication; eQTL GTEx v7 (ref. 55), https://www.gtexportal.org/home/; CAGE and NET-CAGE transcriptome of the five ENCODE cell lines (GM12878, HeLa-S3, HepG2, K562, and MCF-7) (ref. 14), https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE118075; DeepSEA (Beluga)^15,17, https://github.com/FunctionLab/ExPecto; Basenji¹⁸, https://github.com/calico/basenji/tree/0.4; IBD multi-ancestry meta GWAS results^56,57, https://www.ibdgenetics.org; asthma multi-ancestry GWAS results³², http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST005001-GCST006000/GCST005212; results of MPRA: supplementary data in refs. 24,25,26; non-coding credible sets in UK Biobank, supplementary data from ref. 39 and https://www.finucanelab.org/data (release 1.1); dbSNP 151, https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/All_20180423.vcf.gz.

Code availability

The pre-trained MENTR ML models (347 sample ontologies and LCL), and the source code for training MENTR ML models and for running in silico mutagenesis are available at https://github.com/koido/MENTR. We also released the packaged docker image in Docker Hub (https://hub.docker.com/repository/docker/mkoido/mentr). Custom codes for comparing MENTR to ExPecto and Basenji methods are available at https://doi.org/10.5281/zenodo.7008214.

References

Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).
Article CAS PubMed PubMed Central Google Scholar
Forrest, A. R. R. et al. A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).
Article CAS PubMed Google Scholar
Hon, C. C. et al. An atlas of human long non-coding RNAs with accurate 5′ ends. Nature 543, 199–204 (2017).
Article CAS PubMed PubMed Central Google Scholar
Kristjánsdóttir, K. et al. Population-scale study of eRNA transcription reveals bipartite functional enhancer architecture. Nat. Commun. 11, 5963 (2020).
Article PubMed PubMed Central Google Scholar
Boyd, M. et al. Characterization of the enhancer and promoter landscape of inflammatory bowel disease from human colon biopsies. Nat. Commun. 9, 1661 (2018).
Article PubMed PubMed Central Google Scholar
Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012).
Article CAS PubMed PubMed Central Google Scholar
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
Article CAS PubMed PubMed Central Google Scholar
Finucane, H. K. et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nat. Genet. 50, 621–629 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lamparter, D., Marbach, D., Rueedi, R., Kutalik, Z. & Bergmann, S. Fast and rigorous computation of gene and pathway scores from SNP-based summary statistics. PLoS Comput. Biol. 12, 1–20 (2016).
Article Google Scholar
Iotchkova, V. et al. GARFIELD classifies disease-relevant genomic features through integration of functional annotations with association signals. Nat. Genet. 51, 343–353 (2019).
Article CAS PubMed PubMed Central Google Scholar
Terao, C. et al. Ustekinumab as a therapeutic option for Takayasu arteritis: from genetic findings to clinical application. Scand. J. Rheumatol. 45, 80–82 (2016).
Article CAS PubMed Google Scholar
Ardlie, K. G. et al. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).
Article Google Scholar
Ishigaki, K. et al. Polygenic burdens on cell-specific pathways underlie the risk of rheumatoid arthritis. Nat. Genet. 49, 1120–1125 (2017).
Article CAS PubMed Google Scholar
Hirabayashi, S. et al. NET-CAGE characterizes the dynamics and topology of human transcribed cis-regulatory elements. Nat. Genet. 51, 1369–1379 (2019).
Article CAS PubMed Google Scholar
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Article CAS PubMed PubMed Central Google Scholar
Hoffman, G. E., Bendl, J., Girdhar, K., Schadt, E. E. & Roussos, P. Functional interpretation of genetic variants using deep learning predicts impact on chromatin accessibility and histone modification. Nucleic Acids Res. 47, 10597–10611 (2019).
Article CAS PubMed PubMed Central Google Scholar
Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. https://doi.org/10.1038/s41588-018-0160-6 (2018).
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
Article CAS PubMed PubMed Central Google Scholar
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (General Chairs: Krishnapuram, B. & Shah, M.; Program Chairs: Smola, A., Aggarwal, C., Shen, D., & Rastogi, R.) vols 13-17-August-2016 785–794 (Association for Computing Machinery, 2016).
Bühlmann, P. Boosting for high-dimensional linear models. Ann. Stat. 34, 559–583 (2006).
Article Google Scholar
SM, L. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2, 56–67 (2020).
Article Google Scholar
Garieri, M. et al. The effect of genetic variation on promoter usage and enhancer activity. Nat. Commun. 8, 1358 (2017).
Article PubMed PubMed Central Google Scholar
Kumasaka, N., Knights, A. J. & Gaffney, D. J. High-resolution genetic mapping of putative causal interactions between regions of open chromatin. Nat. Genet. 51, 128–137 (2019).
Article CAS PubMed Google Scholar
Mattioli, K. et al. High-throughput functional analysis of lncRNA core promoters elucidates rules governing tissue specificity. Genome Res. 29, 344–355 (2019).
Article CAS PubMed PubMed Central Google Scholar
van Arensbergen, J. et al. High-throughput identification of human SNPs affecting regulatory element activity. Nat. Genet. 51, 1160–1169 (2019).
Article PubMed PubMed Central Google Scholar
Tewhey, R. et al. Direct identification of hundreds of expression-modulating variants using a multiplexed reporter assay. Cell 165, 1519–1529 (2016).
Article CAS PubMed PubMed Central Google Scholar
Akiyama, M. et al. Genome-wide association study identifies 112 new loci for body mass index in the Japanese population. Nat. Genet. 49, 1458–1467 (2017).
Article CAS PubMed Google Scholar
Kanai, M. et al. Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases. Nat. Genet. 50, 390–400 (2018).
Article CAS PubMed Google Scholar
Ishigaki, K. et al. Large-scale genome-wide association study in a Japanese population identifies novel susceptibility loci across different diseases. Nat. Genet. https://doi.org/10.1038/s41588-020-0640-3 (2020).
Taft, R. J., Pang, K. C., Mercer, T. R., Dinger, M. & Mattick, J. S. Non-coding RNAs: regulators of disease. J. Pathol. 220, 126–139 (2010).
Article CAS PubMed Google Scholar
Huang, H. et al. Fine-mapping inflammatory bowel disease loci to single-variant resolution. Nature 547, 173–178 (2017).
Article CAS PubMed PubMed Central Google Scholar
Demenais, F. et al. Multiancestry association study identifies new asthma risk loci that colocalize with immune-cell enhancer marks. Nat. Genet. 50, 42–50 (2018).
Article CAS PubMed Google Scholar
Turner, A. W. et al. Functional analysis of a novel genome-wide association study signal in SMAD3 that confers protection from coronary artery disease. Arterioscler. Thromb. Vasc. Biol. 36, 972–983 (2016).
Article CAS PubMed Google Scholar
Wéra, O., Lancellotti, P. & Oury, C. The dual role of neutrophils in inflammatory bowel diseases. J. Clin. Med. 5, 118 (2016).
Article PubMed PubMed Central Google Scholar
Fahy, J. V. Eosinophilic and neutrophilic inflammation in asthma insights from clinical studies. Proc. Am. Thorac. Soc. 6, 256–259 (2009).
Article CAS PubMed Google Scholar
Yadav, P. K., Chen, C. & Liu, Z. Potential role of NK cells in the pathogenesis of inflammatory bowel disease. J. Biomed. Biotechnol. 2011, 348530 (2011).
Article PubMed PubMed Central Google Scholar
Farh, K. K. H. et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature 518, 337–343 (2015).
Article CAS PubMed Google Scholar
Miller, C. L. et al. Integrative functional genomics identifies regulatory mechanisms at coronary artery disease loci. Nat. Commun. 7, 12092 (2016).
Article CAS PubMed PubMed Central Google Scholar
Nasser, J. et al. Genome-wide enhancer maps link risk variants to disease genes. Nature 17, 1–6 (2021).
Google Scholar
MM, P. et al. The 8q24 cancer risk variant rs6983267 shows long-range interaction with MYC in colorectal cancer. Nat. Genet. 41, 882–884 (2009).
Article Google Scholar
S, T. et al. The common colorectal cancer predisposition SNP rs6983267 at chromosome 8q24 confers potential to enhanced Wnt signaling. Nat. Genet. 41, 885–890 (2009).
Article Google Scholar
Morris, J. A. et al. An atlas of genetic influences on osteoporosis in humans and mice. Nat. Genet. 51, 258–266 (2019).
Article CAS PubMed Google Scholar
Hait, T. A., Amar, D., Shamir, R. & Elkon, R. FOCS: a novel method for analyzing enhancer and gene activity patterns infers an extensive enhancer–promoter map. Genome Biol. 19, 56 (2018).
Article PubMed PubMed Central Google Scholar
Chen, J. et al. WNT7B promotes bone formation in part through mTORC1. PLoS Genet. 10, e1004145 (2014).
Article PubMed PubMed Central Google Scholar
Song, D. et al. Inducible expression of Wnt7b promotes bone formation in aged mice and enhances fracture healing. Bone Res. 8, 4 (2020).
Article CAS PubMed PubMed Central Google Scholar
Busse, W. W. et al. Daclizumab improves asthma control in patients with moderate to severe persistent asthma: a randomized, controlled trial. Am. J. Respir. Crit. Care Med. 178, 1002–1008 (2008).
Article CAS PubMed Google Scholar
Tanaka, N. et al. Eight novel susceptibility loci and putative causal variants in atopic dermatitis. J. Allergy Clin. Immunol. 148, 1293–1306 (2021).
Article CAS PubMed Google Scholar
Hikino, K. et al. Susceptibility loci and polygenic architecture highlight population specific and common genetic features in inguinal hernias: genetics in inguinal hernias. eBioMedicine 70, 103532 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kim, T.-K. et al. Widespread transcription at neuronal activity-regulated enhancers. Nature 465, 182–187 (2010).
Article CAS PubMed PubMed Central Google Scholar
Murakawa, Y. et al. Enhanced identification of transcriptional enhancers provides mechanistic insights into diseases. Trends Genet. 32, 76–88 (2016).
Article CAS PubMed Google Scholar
Saunders, G. et al. Leveraging European infrastructures to access 1 million human genomes by 2022. Nat. Rev. Genet. 20, 693–701 (2019).
Article CAS PubMed PubMed Central Google Scholar
Niculescu-Mizil, A. & Caruana, R. A. Obtaining calibrated probabilities from boosting. Preprint at arXiv:1207.1403 (2012).
Niculescu-Mizil, A. & Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning–ICML (General Chair: Dzeroski, S.; Program Chairs: Raedt, L. D. & Wrobeleds, S.) 625–632 (ACM Press, 2005).
Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
Article CAS PubMed Google Scholar
Aguet, F. et al. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
Article Google Scholar
Liu, J. Z. et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat. Genet. 47, 979–986 (2015).
Article CAS PubMed PubMed Central Google Scholar
Franke, A. et al. Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nat. Genet. 42, 1118–1125 (2010).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank FANTOM consortium members for providing datasets and valuable discussions. Computational resources of AI Bridging Cloud Infrastructure (ABCI) provided by the National Institute of Advanced Industrial Science and Technology (AIST) were used for in silico mutagenesis. This work was supported in part by JSPS KAKENHI (grant number 20K15773, to M.K.), JP20H00462, the JCR Grant for Promoting Basic Rheumatology, and AMED (under grant numbers JP21kk0305013, JP21tm0424220 and JP21ck0106642, to C.T.).

Author information

Authors and Affiliations

Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
Masaru Koido, Kazuyoshi Ishigaki, Yoichiro Kamatani & Chikashi Terao
Division of Molecular Pathology, Department of Cancer Biology, Institute of Medical Science, The University of Tokyo, Tokyo, Japan
Masaru Koido
Laboratory of Complex Trait Genomics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
Masaru Koido & Yoichiro Kamatani
Laboratory for Genome Information Analysis, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
Chung-Chau Hon
Laboratory for Cardiovascular Genomics and Informatics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
Satoshi Koyama & Kaoru Ito
Preventive Medicine and Applied Genomics Unit, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
Hideya Kawaji
Research Center for Genome & Medical Sciences, Tokyo Metropolitan Institute of Medical Science, Tokyo, Japan
Hideya Kawaji
RIKEN-IFOM Joint Laboratory for Cancer Genomics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
Yasuhiro Murakawa
IFOM ETS - The AIRC Institute of Molecular Oncology, Milan, Italy
Yasuhiro Murakawa
Institute for the Advanced Study of Human Biology, Kyoto University, Kyoto, Japan
Yasuhiro Murakawa
Divisions of Genetics and Rheumatology, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
Kazuyoshi Ishigaki
Center for Data Sciences, Harvard Medical School, Boston, MA, USA
Kazuyoshi Ishigaki
Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Kazuyoshi Ishigaki
Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, Aomi, Koto-ku, Tokyo, Japan
Jun Sese
Humanome Lab Inc., Tokyo, Japan
Jun Sese
Genome Immunobiology RIKEN Hakubi Research Team, RIKEN Cluster for Pioneering Research and RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
Nicholas F. Parrish
Laboratory for Transcriptome Technology, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
Piero Carninci
Laboratory for Single Cell Technologies, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
Piero Carninci
Human Technopole, Milan, Italy
Piero Carninci
Clinical Research Center, Shizuoka General Hospital, Shizuoka, Japan
Chikashi Terao
The Department of Applied Genetics, The School of Pharmaceutical Sciences, University of Shizuoka, Shizuoka, Japan
Chikashi Terao

Authors

Masaru Koido
View author publications
You can also search for this author in PubMed Google Scholar
Chung-Chau Hon
View author publications
You can also search for this author in PubMed Google Scholar
Satoshi Koyama
View author publications
You can also search for this author in PubMed Google Scholar
Hideya Kawaji
View author publications
You can also search for this author in PubMed Google Scholar
Yasuhiro Murakawa
View author publications
You can also search for this author in PubMed Google Scholar
Kazuyoshi Ishigaki
View author publications
You can also search for this author in PubMed Google Scholar
Kaoru Ito
View author publications
You can also search for this author in PubMed Google Scholar
Jun Sese
View author publications
You can also search for this author in PubMed Google Scholar
Nicholas F. Parrish
View author publications
You can also search for this author in PubMed Google Scholar
Yoichiro Kamatani
View author publications
You can also search for this author in PubMed Google Scholar
Piero Carninci
View author publications
You can also search for this author in PubMed Google Scholar
Chikashi Terao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.K., C.-C.H., Y.K. and C.T. conceived the study. M.K. conducted analysis with the help of C.-C.H., S.K., K. Ishigaki, K. Ito and P.C. C.-C.H. analysed CAGE transcriptome data. H.K. and Y.M. analysed NET-CAGE transcriptome data. M.K. and C.T. wrote the manuscript, and N.F.P. provided critical comments and valuable edits. J.S. contributed to providing graphics processing unit computational resources necessary for the study. P.C. and C.T. supervised the study.

Corresponding author

Correspondence to Chikashi Terao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks Colin Campbell, William Ritchie and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Details about MENTR ML.

(a) Comparisons between required datasets for MENTR ML models and conventional eQTL study. In MENTR ML models, CAGE transcriptome data is only required. Existing large-scale CAGE transcriptome, such as FANTOM5 datasets, can be used. In eQTL study, transcriptome data for the tissue and genotypes from the same individuals are required for estimating mutation effects for each transcript (β_eQTL) in a tissue. (b) Workflow of MENTR ML training and evaluation. See the details in the Methods section.

Extended Data Fig. 2 Accurate prediction of ncRNA expression by combining MENTR ML models with CAGE transcriptome.

(a) Schematic of comparison of accuracies among MENTR, MENTR_linear, and ExPecto methods. (b) Prediction accuracies of ExPecto methods¹⁷ on GTEx RNA sequence datasets (re-analysis of predictive accuracies among 218 types of tissues) and FANTOM5 CAGE transcriptome datasets (347 sample ontologies). The box plots (N = 218 for GTEx and N = 347 for FANTOM5) show the first and third quartiles, the centerline represented the median, the upper whisker extended from the hinge to the highest value that is within 1.5 × IQR (inter-quartile range) of the hinge, the lower whisker extended from the hinge to the lowest value within 1.5 × IQR of the hinge and the data beyond the end of the whiskers were plotted as points. (c) Prediction accuracies of the indicated methods (x-axis) on lncRNAs, and mRNAs in the FANTOM5 CAGE transcriptome datasets (n = 347). Spearman’s ρ and AUROC values were compared by violin plot and the mean values were shown by dot. P-values were calculated by two-sided Wilcoxon signed rank test. (d) Comparison of Spearman’s ρ for each model (n = 347) trained by MENTR or ExPecto method.

Extended Data Fig. 3 Prediction of lowly expressed, cell-type specific enhancers by MENTR.

(a) Mean expression levels for each FANTOM5 sample ontologies (n = 347) in testing dataset. (b) Distribution of Shannon entropy-based cell-type specificity score¹⁴ in testing dataset of the sample ontologies. The 0 score means ubiquitous expression and 1 score means cell-type specific expression. (c) Prediction accuracies of MENTR models for enhancer RNAs, lncRNAs, and mRNAs, stratified by cell-type specificity scores. (d) Prediction accuracies of MENTR models for enhancer RNAs, lncRNAs, and mRNAs, stratified by quartile of expression levels. The quartile bins were determined from all types of transcript for each sample ontology. The mean values were shown by dot in violin plot in (a, c, and d).

Extended Data Fig. 4 Differences of prediction accuracies between MENTR and MENTR_reg.

Comparison of Spearman’s ρ values between MENTR and MENTR_reg, using scatter plot (upper) and histogram for the differences (lower) for each transcript type. P-values were calculated by two-sided Wilcoxon signed rank test.

Extended Data Fig. 5 Evaluation of prediction accuracies for CAGE and NET-CAGE transcript.

X-axis, transcript type; y-axis, AUROC. Train represented which data sets were used for training, which was shown by the plot color. Evaluation and Train pairs represented which type of CAGE datasets were evaluated by the indicated model. (a) Evaluation of accuracy for CAGE transcripts predicted by MENTR ML models trained on CAGE transcripts, and that for NET-CAGE transcripts predicted by MENTR ML models trained on NET-CAGE transcripts. (b) Evaluation of accuracy for CAGE transcripts predicted by MENTR ML models trained on CAGE transcripts and NET-CAGE transcripts. (c) Evaluation of accuracy for NET-CAGE transcripts predicted by MENTR ML models trained on CAGE transcripts and NET-CAGE transcripts.

Extended Data Fig. 6 Feature importance of MENTR.

(a) Comparison of quantile-normalized SHAP values of input features from lncRNA or enhancer RNA (y-axis) and mRNA (x-axis). We compared them for assay type (upper boxes; DNase, Histone, and TF) and aggregation weight in MENTR (right boxes; 0.01, 0.02, 0.05, 0.1 and 0.2; see Methods). (b) Top 5 distinct features which were more important in enhancer predictions (vs. mRNA; weight = 0.01) from TF features.

Extended Data Fig. 7 Schematic comparison of accuracies between MENTR and Basenji.

Basenji pre-trained models were obtained from ref. 18.

Extended Data Fig. 8 Cell-type dependencies for Spearman’s ρ between MENTR mutation effects and β_QTL.

Y-axis, Spearman’s ρ between MENTR mutation effects and effect size from the QTL studies (β_QTL; see Fig. 4) for each model (each dot represented one of the FANTOM5 347 sample ontologies and LCL and the dashed line indicated the ρ value for LCL); x-axis and color key, transcriptome correlation between with LCL CAGE transcriptome and each of FANTOM5 347 CAGE transcriptome. We excluded variants with 0 mutation effect from this analysis. P-values were calculated by two-sided Spearman’s rank correlation test. p < 2.2×10⁻¹⁶ indicated that the p-value was lower than the default machine epsilon value (2.2×10⁻¹⁶).

Extended Data Fig. 9 Gene-level verification of MENTR in silico mutation effects for various types of tissues.

(a) Workflow of calculating gene-level mutation effects (Δy). Δy values were calculated from promoter-level mutation effects (Δy_p) after filtered by the baseline, permissive, and robust threshold. (b, c) Concordance rate (y-axis) of directions of the Δy and effect size of eQTL from GTEx v7 at the indicated threshold of absolute Δy. Variants within + /− 1 kb in autosome and chromosome X were tested in (b), and variants within + /− 100 kb in chromosome 8 were tested in (c). The concordance rates of 26 tissues (Supplementary Table 8) were shown by violin plot and the mean values were shown as dot. P-values were calculated by two-sided Wilcoxon signed rank test.

Extended Data Fig. 10 PIP distribution of ten UK biobank complex diseases, stratified by MENTR predictions.

Distribution of PIP (posterior inclusion probability) for 10 complex diseases, stratified by MENTR robust prediction, permissive prediction (but not robust prediction; shown as “only permissive”) or others. The definition of robust and permissive was written in Result section. Fibroblastic_Disorders, Fibroblastic disorders; CRC, Colorectal cancer; Glaucoma_Combined, Glaucoma (Phecode + Self-reported); PrC, Prostate cancer; BrC, Breast cancer; AID_Combined, Autoimmune disease (Phecode + Self-reported). These abbreviations and non-coding credible sets (see Methods) were obtained from Supplementary Table 7 in Nasser et al.³⁹, and their PIP values were obtained from https://www.finucanelab.org/data (release 1.1). The box plots show the first and third quartiles, the center line represented the median, the upper whisker extended from the hinge to the highest value that is within 1.5 × IQR (inter-quartile range) of the hinge, the lower whisker extended from the hinge to the lowest value within 1.5 × IQR of the hinge, and the data beyond the end of the whiskers were plotted as points. See N of each group in Supplementary Table 24.

Supplementary information

Main Supplementary Information

Supplementary figures, discussion and references.

Reporting Summary

Peer Review File

Supplementary Datasets

Supplementary Tables 1–24.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Koido, M., Hon, CC., Koyama, S. et al. Prediction of the cell-type-specific transcription of non-coding RNAs from genome sequences via machine learning. Nat. Biomed. Eng 7, 830–844 (2023). https://doi.org/10.1038/s41551-022-00961-8

Download citation

Received: 25 March 2021
Accepted: 12 October 2022
Published: 21 November 2022
Issue Date: June 2023
DOI: https://doi.org/10.1038/s41551-022-00961-8

This article is cited by

Artificial Intelligence for Risk Assessment on Primary Prevention of Coronary Artery Disease
- Shang-Fu Chen
- Salvatore Loguercio
- Ali Torkamani
Current Cardiovascular Risk Reports (2023)
Toward a comprehensive catalog of regulatory elements
- Kaili Fan
- Edith Pfister
- Zhiping Weng
Human Genetics (2023)
Predicting pathogenicity from non-coding mutations
- Colin Campbell
- Amy Francis
- Tom R. Gaunt
Nature Biomedical Engineering (2022)