Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Integrated transcriptomic–genomic tool Texomer profiles cancer tissues

Abstract

Profiling of both the genome and the transcriptome promises a comprehensive, functional readout of a tissue sample, yet analytical approaches are required to translate the increased data dimensionality, heterogeneity and complexity into patient benefits. We developed a statistical approach called Texomer (https://github.com/KChen-lab/Texomer) that performs allele-specific, tumor-deconvoluted transcriptome–exome integration of autologous bulk whole-exome and transcriptome sequencing data. Texomer results in substantially improved accuracy in sample categorization and functional variant prioritization.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Texomer improved DNA–RNA joint TCGA BRCA sample categorization.
Fig. 2: Application of Texomer for functional variant characterization.

Similar content being viewed by others

Data availability

We downloaded the bulk WES and WTS data of BRCA (n = 833) and SKCM samples (n = 465) from TCGA (dbGAP accession phs000178.v9.p8). We downloaded the single-cell RNA-seq data and matched bulk WES and WTS data from 11 breast cancer samples from NCBI under accessions GSE75688 and SRP067248. We downloaded the WES and WTS data of breast cancer cell line HCC1143 and matched normal cell line HCC1143BL from the cancer cell line encyclopedia (CCLE) project of the Genomic Data Commons (GDC) Data Portal.

Code availability

Texomer is available in GitHub at https://github.com/KChen-lab/Texomer.

References

  1. Yohe, S. & Thyagarajan, B. Review of clinical next-generation sequencing. Arch. Pathol. Lab. Med. 141, 1544–1557 (2017).

    Article  PubMed  Google Scholar 

  2. Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–424 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  3. McGranahan, N. & Swanton, C. Clonal heterogeneity and tumor evolution: past, present, and the future. Cell 168, 613–628 (2017).

    Article  CAS  PubMed  Google Scholar 

  4. Huang, S., Chaudhary, K. & Garmire, L. X. More is better: recent progress in multi-omics data integration methods. Front. Genet. 8, 84 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Hasin, Y., Seldin, M. & Lusis, A. Multi-omics approaches to disease. Genome Biol. 18, 83 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  6. Yadav, V. K. & De, S. An assessment of computational methods for estimating purity and clonality using genomic data derived from heterogeneous tumor tissue samples. Brief. Bioinform. 16, 232–241 (2015).

    Article  CAS  PubMed  Google Scholar 

  7. Mohanty, V., Akmamedova, O. & Komurov, K. Selective DNA methylation in cancers controls collateral damage induced by large structural variations. Oncotarget 8, 71385–71392 (2017).

    PubMed  Google Scholar 

  8. Weischenfeldt, J. et al. Pan-cancer analysis of somatic copy-number alterations implicates IRS4 and IGF2 in enhancer hijacking. Nat. Genet. 49, 65–74 (2017).

    Article  CAS  PubMed  Google Scholar 

  9. Van Loo, P. et al. Allele-specific copy number analysis of tumors. Proc. Natl Acad. Sci. USA 107, 16910–16915 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Ha, G. et al. TITAN: inference of copy number architectures in clonal cell populations from tumor whole-genome sequence data. Genome Res. 24, 1881–1893 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Favero, F. et al. Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data. Ann. Oncol. 26, 64–70 (2015).

    Article  CAS  PubMed  Google Scholar 

  12. Shen, R. & Seshan, V. E. FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing. Nucleic Acids Res. 44, e131 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Hutter, C. & Zenklusen, J. C. The Cancer Genome Atlas: creating lasting value beyond its data. Cell 173, 283–285 (2018).

    Article  CAS  PubMed  Google Scholar 

  14. Searle, S. R., Casella, G. & McCulloch, C. E. Variance Components. (Wiley, New York, 1992).

    Book  Google Scholar 

  15. Dogruluk, T. et al. Identification of variant-specific functions of PIK3CA by rapid phenotyping of rare mutations. Cancer Res. 75, 5341–5354 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Tang, H. & Thomas, P. D. Tools for predicting the functional impact of nonsynonymous genetic variation. Genetics 203, 635–647 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Chakravarty, D. et al. OncoKB: A precision oncology knowledge base. JCO Precis. Oncol. http://ascopubs.org/doi/full/10.1200/PO.17.00011 (2017).

  18. van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

  19. Martin, E., Hans-Peter, K., Jörg, S. & Xiaowei, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD-96 Proceedings, AAAI Press 226-231 (1996).

  20. Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Carter, S. L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30, 413–421 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Carithers, L. J. & Moore, H. M. The Genotype-Tissue Expression (GTEx) Project. Biopreserv. Biobank. 13, 307–308 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the NIH (R01CA172652 to K.C., U01CA217842 to G.B.M., U24CA211006 to L.D., U24CA210950 to R.A.), the CPRIT (RP180248 to K.C.), the MD Anderson Cancer Center Sheikh Khalifa Ben Zayed Al Nahyan Institute of Personalized Cancer Therapy grant and an NCI Cancer Center Support Grant (P30 CA016672 to P.P.). We also thank Y. Chen, T. Hart, B. Lim, G. Lozano, S. Xiong, L. Wang and X. Song for insightful discussions, and X. Zheng for data curation.

Author information

Authors and Affiliations

Authors

Contributions

F.W. and K.C. designed the experiments. F.W. and S.Z. conceived the statistical model. F.W. developed the software. T.B.K., Z.W., V.M., M.C.W. and L.D. contributed to evaluation. Y.L. and R.I. contributed to the software release. K.S. and J.A.K. contributed to pathology review, and F.M., J.N.W. and G.B.M. contributed to biological interpretation. K.C. and F.W. wrote the manuscript, with input from all other authors. K.C. supervised the work.

Corresponding author

Correspondence to Ken Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Illustration of Texomer deconvolution steps.

Texomer iteratively estimates ASCN (gray and green boxes), tumor purity (αD), ITH scores from the bulk WES data (steps 1–4), and tumor purity (αR) and ASELs from the bulk tumor WTS data (step 5). It then probabilistically classifies variants into three categories (step 6): ASCN-concordant (ASEL ≈ ASCN), ASCN-discordant high (ASEL > ASCN) or low (ASEL < ASCN).

Supplementary Figure 2 Quantification of ITH.

a, Definition of ITH scores: the difference (purple) between the probability density of the ASCNs estimated from somatic variants and that from germline variants. b, Example TCGA sample of a low ITH score. c, Estimated ASCNs of the germline SNP alleles (red and green boxes) and somatic SNV alleles (blue dots) of the sample in b. Plotted as boxes are integerized ASCN segments of the major (red) and minor (green) alleles. d, Example TCGA sample of a high ITH score. e, Estimated ASCNs of the sample in d.

Supplementary Figure 3 The association between ASCNs and ASELs.

ad, The association between ASCNs and ASELs of germline SNPs (a) and somatic SNVs (b) in BRCA, germline SNPs (c) and somatic SNVs (d) in SKCM. N is the number of variant alleles, and r is the Spearman’s rank correlation coefficient. The black horizontal line in the box is the median; the top and bottom horizontal lines denote 95% confidence interval of the median; the top and bottom of box are the 25th and 75th percentiles for the data.

Supplementary Figure 4 Illustration of DACRE.

Transcription factors bind to promoters of genes and start mRNA transcription. Variants in cis-regulatory regions or gene bodies can cause differential allelic regulation effects, which further lead to variation of cell phenotypes, including generation of malignant cells. Samples 1 and 2 illustrate different cis-regulatory mechanisms. a, In sample 1, a DNA variant (black triangle) in the promoter of gene A prevents binding of the transcription factor to the variant allele, causing the variant allele to have lower expression. b, In sample 2, allele-specific methylation (black lollipop) decreases to the expression level of the wild-type allele. c, In sample 3, a DNA variant in the transcription factor (TF) reduces the binding affinity of the TF to the promoters of gene C, causing both alleles to have lower expression. d, In sample 4, DNA variant in gene D directly leads to loss of expression of the variant allele.

Supplementary Figure 5 Accuracy of Texomer transformation.

a, Comparison of inferred tumor purity values versus simulated ground truth for 5 methods: Texomer, ASCAT, TITAN, sequenza, and FACETs. b, Average absolute errors between simulated and inferred tumor purity by the five methods. c, Average absolute errors between simulated and inferred DNA copy number by the five methods, as well as average absolute errors between simulated and inferred RNA expression levels by Texomer. d, Pearson’s correlation coefficients between the Texomer transformed tumor ASELs and single-cell gene expression levels in 11 breast cancer tissues (BC01–BC11). e, Pearson’s correlation coefficients between the bulk tumor WTS read counts and single-cell gene expression levels, in contrast to d. P value determined by one-tailed t test (n = 11). The black horizontal line in the box of d and e is the median; the top and bottom horizontal lines denote 95% confidence interval of the median; the top and bottom of box are the 25th and 75th percentiles for the data.

Supplementary Figure 6 Evaluation of tumor purity and ITH scores on TCGA BRCA and SKCM data.

a, Pearson’s correlation coefficients between tumor purities estimated by Texomer, ASCAT, sequenza, TITAN and FACES, and tumor purities estimated by ABSOLUTE, ESTIMATE, DNA methylation and pathologist review on TCGA BRCA (n = 832) and SKCM data (n = 465). b,c, Survival analysis of patients with BRCA luminal B subtype (b) and SKCM (c) based on heterogeneity score. ‘High,’ ITH score ≥ 0.5; ‘Low,’ ITH score < 0.5. P value determined by two-sided log-rank test. d,e, PD-L1 expression levels with respect to heterogeneity score in TCGA BRCA (d) and SKCM (e) patients. P value determined by two-tailed Mann–Whitney test (BRCA: n = 832; SKCM: n = 465). The black horizontal line in the box is the median; the top and bottom horizontal lines denote 95% confidence interval of the median; the top and bottom of box are the 25th and 75th percentiles for the data.

Supplementary Figure 7 Difference in the ARI values between Texomer-transformed clusters and bulk clusters.

ad, The number of clusters are specified independently in the DNA (y-axis) and the RNA (x-axis) data for the (a) DBSCAN, (b) k-means, (c) hierarchical clustering, and (d) mclust algorithms. In almost all the cases, the difference is positive (warm colors).

Supplementary Figure 8 Characteristics of clusters from TT-RNA profiles.

a, Most clusters express cluster-specific KEGG pathways based on GSEA (P < 0.01, n = 570 genes). b,c, The association between clusters from TT-RNA profile and PAM50 subtype (b) and the association between clusters from bulk RNA profile and PAM50 subtypes (c). Log(ratio) in y-axis represent log2 transformation of the subtype enrichment ratios. Dashed line corresponds to ratio = 1.5. df, Progression free survival (PFS) analysis based on eight Texomer-derived clusters, PAM50 subtypes and four clusters derived from bulk RNA profile, respectively. P value determined by two-sided log-rank test.

Supplementary Figure 9 Association between the DNA and the RNA profiles.

ad, Plotted in three rows are the Pearson’s correlation coefficients between bulk DNA count and bulk RNA count, between Texomer-transformed tumor DNA (TT-DNA) profile and bulk RNA count, and between TT-DNA and TT-RNA profile at the germline SNP and somatic SNV sites, for allele-agnostic analysis of the BRCA samples (a) and the SKCM samples (c) or allele-specific analysis in the BRCA samples (b) and the SKCM samples (d). The black horizontal line in the box is the median; the top and bottom horizontal lines denote 95% confidence interval of the median; the top and bottom of box are the 25th and 75th percentiles for the data. e,f, The proportion of variance from the same three pairs of sources were plotted for allele-agnostic (e) and allele-specific analysis (f) in the SKCM samples. P value determined by one-tailed t test (BRCA: n = 832; SKCM: n = 465).

Supplementary Figure 10 Variance component analysis of RNA profile explained by DNA profile.

a,b, The variance of the bulk WTS read counts explained by the bulk WES read counts (Bulk RNA: Bulk DNA), variance of the bulk WTS read counts explained by the TT-DNA copy-number profile (Bulk RNA: TT-DNA), and variance of the TT-RNA expression profile explained by the TT-DNA copy-number profile (TT-RNA: TT-DNA) derived from the total and the allele-specific level using breast cancer cell line HCC1143 (a) and in silico simulation data (b). ‘Expected’ represented the proportion of variance estimated from the isogenic tumor cell line (HCC1143) data or specified in simulation.

Supplementary Figure 11 The correlation between experimentally assessed activity levels and the scores computed by a set of predictors.

a, Pearson’s correlation coefficients (PCC). b, Spearman’s rank correlation coefficients (SRCC; n = 13).

Supplementary Figure 12 Characterization of variants with negative DACRE scores.

a,b, Average DNA methylation levels of genes with negative (black arrows) and positive (red arrows) DACRE scores in 832 BRCA (a) and 465 SKCM (b) samples, compared with a background distribution (gray histogram) constructed from randomly selected gene sets. c, The DACRE, ASCNs and ASELs of the six TCGA SKCM samples containing heterozygous TERT exonic variants. d, DNA methylation levels in the TERT promoter of the TCGA samples. e, RNA expression levels of TCGA SKCM samples (n = 465). TCGA-EE-A2M5 and TCGA-FR-A726 are labeled with blue and red asterisks, respectively. The black horizontal line in the box is the median; the top and bottom horizontal lines denote 95% confidence interval of the median; the top and bottom of box are the 25th and 75th percentiles for the data.

Supplementary Figure 13 Somatic SNVs with positive DACRE scores in TSGs.

a, Heat map characterizing the DACRE, ASELs of the variant alleles (ASEL_variant) and the wild-type alleles (ASEL_wildtype) for each somatic mutation detected in TP53 in the BRCA samples. Values plotted are summed over carrier samples. High-frequency mutations are more likely to have more extreme scores. Mutations labeled starting with ‘chr’ are noncoding mutations. GATA3:p.M294K gain of function predicted by DACRE-scan. b, Three samples harboring GATA3:p.M294K. The ASCNs (left panel) and the ASELs (right panel) estimated for the wild-type and variant alleles in the BRCA samples carrying the mutation. c, Deconvoluted total tumor expression levels (TELs) and the bulk expression levels (FPKMs) among the BRCA samples harboring germline or somatic variants in GATA3 (n = 291). Asterisks (*) indicate the three samples carrying p.M294K. The black horizontal line in the box is the median; the top and bottom horizontal lines denote 95% confidence interval of the median; the top and bottom of box are the 25th and 75th percentiles for the data. d, Breakdown of DACRE scores of CDH1 somatic SNVs in the BRCA samples. e,f, The ASCNs (left panel) and the ASELs (right panel) estimated for the wild-type and the variant alleles in the BRCA samples containing CDH1:p.E243K (e) and CDH1:p.R335* (f).

Supplementary Figure 14 Precisions of predicted mis-sense functional mutations.

a,b, Precisions of predicted mis-sense functional mutations for individual BRCA (a) and SKCM (b) patients in TCGA. The x- and y-axes are the prediction precisions before and after application of DACRE score filtering (>0). Above diagonal lines are samples with improved precision after filtering.

Supplementary Figure 15 WES/WTS joint characterization of functional somatic mutations in TCGA data.

a,b, Somatic SNVs were first prioritized on the basis of functional impact scores calculated by a set of widely used functional annotators, and then further filtered by the bulk WTS count (>20) or the DACRE score (>0). The pink and blue bars show the fractions of patient samples in 832 BRCA (a) and 465 SKCM (b) patient samples with increased proportions of known functional variants in the OncoKB database after WES/WTS profiling. df, Level of accuracy improvement in the BRCA (c,d) and the SKCM samples (e,f) after applying DACRE score > 0 (c,d) and WTS read count > 20 (e,f) filtering. ‘None’ represents no or negative improvement, 0% represents a level of improvement between 0% and 1%, 1% represents a level of improvement between 1% and 10%, etc.

Supplementary information

Supplementary Information

Supplementary Figures 1–15, Supplementary Tables 1–3 and Supplementary Notes 1–5

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, F., Zhang, S., Kim, TB. et al. Integrated transcriptomic–genomic tool Texomer profiles cancer tissues. Nat Methods 16, 401–404 (2019). https://doi.org/10.1038/s41592-019-0388-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-019-0388-9

This article is cited by

Search

Quick links

Nature Briefing: Cancer

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

Get what matters in cancer research, free to your inbox weekly. Sign up for Nature Briefing: Cancer