Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Inferring gene expression from cell-free DNA fragmentation profiles

Abstract

Profiling of circulating tumor DNA (ctDNA) in the bloodstream shows promise for noninvasive cancer detection. Chromatin fragmentation features have previously been explored to infer gene expression profiles from cell-free DNA (cfDNA), but current fragmentomic methods require high concentrations of tumor-derived DNA and provide limited resolution. Here we describe promoter fragmentation entropy as an epigenomic cfDNA feature that predicts RNA expression levels at individual genes. We developed ‘epigenetic expression inference from cell-free DNA-sequencing’ (EPIC-seq), a method that uses targeted sequencing of promoters of genes of interest. Profiling 329 blood samples from 201 patients with cancer and 87 healthy adults, we demonstrate classification of subtypes of lung carcinoma and diffuse large B cell lymphoma. Applying EPIC-seq to serial blood samples from patients treated with PD-(L)1 immune-checkpoint inhibitors, we show that gene expression profiles inferred by EPIC-seq are correlated with clinical response. Our results indicate that EPIC-seq could enable noninvasive, high-throughput tissue-of-origin characterization with diagnostic, prognostic and therapeutic potential.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Correlation of gene expression and cfDNA molecular features.
Fig. 2: Fragment size entropy in relation to gene structure informs expression inferences from whole-exome cfDNA profiling.
Fig. 3: EPIC-seq design and workflow.
Fig. 4: Application of EPIC-seq for lung cancer detection and histological classification.
Fig. 5: Application of EPIC-seq for DLBCL detection.
Fig. 6: Application of EPIC-seq for DLBCL COO classification.

Similar content being viewed by others

Data availability

For each sample profiled in this study, we also provide anonymized fragmentomic data for fragments meeting minimal MAPQ and read FLAGs, which are available at https://epicseq.stanford.edu. These data are summarized across TSS regions by fragment size distributions (as in Fig. 1b). Moreover, the anonymized sequencing reads of samples profiled whole-genome (n = 3 deep whole-genome; n = 3 whole-genome and n = 24 shallow whole-genome) and whole-exome (n = 39 deep WES) are deposited at SRA PRJNA795275.

Code availability

The custom EPIC-seq software code for fragmentomic featurization and gene expression inference from cfDNA BAM files can be accessed at https://epicseq.stanford.edu/.

References

  1. Jahr, S. et al. DNA fragments in the blood plasma of cancer patients: quantitations and evidence for their origin from apoptotic and necrotic cells. Cancer Res. 61, 1659–1665 (2001).

    CAS  PubMed  Google Scholar 

  2. Lo, Y. M. et al. Maternal plasma DNA sequencing reveals the genome-wide genetic and mutational profile of the fetus. Sci. Transl. Med. 2, 61ra91 (2010).

    Article  CAS  PubMed  Google Scholar 

  3. Heitzer, E., Auinger, L. & Speicher, M. R. Cell-free DNA and apoptosis: how dead cells inform about the living. Trends Mol. Med. 26, 519–528 (2020).

    Article  CAS  PubMed  Google Scholar 

  4. Newman, A. M. et al. An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage. Nat. Med. 20, 548–554 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Phallen, J. et al. Direct detection of early-stage cancers using circulating tumor DNA. Sci. Transl. Med. 9, eaan2415 (2017).

  6. Cohen, J. D. et al. Detection and localization of surgically resectable cancers with a multi-analyte blood test. Science 359, 926–930 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Cristiano, S. et al. Genome-wide cell-free DNA fragmentation in patients with cancer. Nature 570, 385–389 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Heitzer, E., Haque, I. S., Roberts, C. E. S. & Speicher, M. R. Current and future perspectives of liquid biopsies in genomics-driven oncology. Nat. Rev. Genet. 20, 71–88 (2019).

    Article  CAS  PubMed  Google Scholar 

  9. Chabon, J. J. et al. Integrating genomic features for non-invasive early lung cancer detection. Nature 580, 245–251 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Van Opstal, D. et al. Origin and clinical relevance of chromosomal aberrations other than the common trisomies detected by genome-wide NIPS: results of the TRIDENT study. Genet. Med. 20, 480–485 (2018).

    Article  PubMed  Google Scholar 

  11. Fan, H. C. et al. Non-invasive prenatal measurement of the fetal genome. Nature 487, 320–324 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Knight, S. R., Thorne, A. & Lo Faro, M. L. Donor-specific cell-free DNA as a biomarker in solid organ transplantation. a systematic review. Transplantation 103, 273–283 (2019).

    Article  CAS  PubMed  Google Scholar 

  13. Chaudhuri, A. A. et al. Early detection of molecular residual disease in localized lung cancer by circulating tumor DNA profiling. Cancer Discov. 7, 1394–1403 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Lennon, A. M. et al. Feasibility of blood testing combined with PET-CT to screen for cancer and guide intervention. Science 369, eabb9601 (2020).

  15. Zviran, A. et al. Genome-wide cell-free DNA mutational integration enables ultra-sensitive cancer monitoring. Nat. Med. 26, 1114–1124 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Lo, Y. M. et al. Presence of donor-specific DNA in plasma of kidney and liver-transplant recipients. Lancet 351, 1329–1330 (1998).

    Article  CAS  PubMed  Google Scholar 

  17. Snyder, T. M., Khush, K. K., Valantine, H. A. & Quake, S. R. Universal noninvasive detection of solid organ transplant rejection. Proc. Natl Acad. Sci. USA 108, 6229–6234 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Lehmann-Werman, R. et al. Identification of tissue-specific cell death using methylation patterns of circulating DNA. Proc. Natl Acad. Sci. USA 113, E1826–1834 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Jiang, P. et al. Preferred end coordinates and somatic variants as signatures of circulating tumor DNA associated with hepatocellular carcinoma. Proc. Natl Acad. Sci. USA 115, E10925–E10933 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Sun, K. et al. Orientation-aware plasma cell-free DNA fragmentation analysis in open chromatin regions informs tissue of origin. Genome Res. 29, 418–427 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Sadeh, R. et al. ChIP–seq of plasma cell-free nucleosomes identifies gene expression programs of the cells of origin. Nat. Biotechnol. 39, 586–598 (2021).

  22. Lui, Y. Y. et al. Predominant hematopoietic origin of cell-free DNA in plasma and serum after sex-mismatched bone marrow transplantation. Clin. Chem. 48, 421–427 (2002).

    Article  CAS  PubMed  Google Scholar 

  23. Fleischhacker, M. & Schmidt, B. Circulating nucleic acids (CNAs) and cancer—a survey. Biochim. Biophys. Acta 1775, 181–232 (2007).

    CAS  PubMed  Google Scholar 

  24. Ramachandran, S., Ahmad, K. & Henikoff, S. Transcription and remodeling produce asymmetrically unwrapped nucleosomal intermediates. Mol. Cell 68, 1038–1053 e1034 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Snyder, M. W., Kircher, M., Hill, A. J., Daza, R. M. & Shendure, J. Cell-free DNA comprises an in vivo nucleosome footprint that informs its tissues-of-origin. Cell 164, 57–68 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Ivanov, M., Baranova, A., Butler, T., Spellman, P. & Mileyko, V. Non-random fragmentation patterns in circulating cell-free DNA reflect epigenetic regulation. BMC Genomics 16, S1 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  27. Ulz, P. et al. Inferring expressed genes by whole-genome sequencing of plasma DNA. Nat. Genet. 48, 1273–1278 (2016).

    Article  CAS  PubMed  Google Scholar 

  28. Wu, J. et al. Decoding genetic and epigenetic information embedded in cell free DNA with adapted SALP-seq. Int. J. Cancer 145, 2395–2406 (2019).

    Article  CAS  PubMed  Google Scholar 

  29. Jiang, P. et al. Lengthening and shortening of plasma DNA in hepatocellular carcinoma patients. Proc. Natl Acad. Sci. USA 112, E1317–1325 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Underhill, H. R. et al. Fragment length of circulating tumor DNA. PLoS Genet. 12, e1006162 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Mouliere, F. et al. Enhanced detection of circulating tumor DNA by fragment size analysis. Sci. Transl. Med. 10, eaat4921 (2018).

  32. Ulz, P. et al. Inference of transcription factor binding from cell-free DNA enables tumor subtype prediction and early detection. Nat. Commun. 10, 4666 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Moss, J. et al. Comprehensive human cell-type methylation atlas reveals origins of circulating cell-free DNA in health and disease. Nat. Commun. 9, 5068 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Weintraub, H. & Groudine, M. Chromosomal subunits in active genes have an altered conformation. Science 193, 848–856 (1976).

    Article  CAS  PubMed  Google Scholar 

  35. Jiang, P. et al. Plasma DNA end-motif profiling as a fragmentomic marker in cancer, pregnancy, and transplantation. Cancer Discov. 10, 664–673 (2020).

    Article  CAS  PubMed  Google Scholar 

  36. The Cancer Genome Atlas Research Network Comprehensive molecular profiling of lung adenocarcinoma. Nature 511, 543–550 (2014).

  37. The Cancer Genome Atlas Research Network Comprehensive genomic characterization of squamous cell lung cancers. Nature 489, 519–525 (2012).

  38. Puglisi, F. et al. Prognostic value of thyroid transcription factor-1 in primary, resected, non-small cell lung carcinoma. Mod. Pathol. 12, 318–324 (1999).

  39. Maloney, D. G. et al. Phase I clinical trial using escalating single-dose infusion of chimeric anti-CD20 monoclonal antibody (IDEC-C2B8) in patients with recurrent B-cell lymphoma. Blood 84, 2457–2466 (1994).

  40. Ferlay, J. et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int. J. Cancer 136, E359–386 (2015).

    Article  CAS  PubMed  Google Scholar 

  41. Torre, L. A., Siegel, R. L. & Jemal, A. Lung cancer statistics. Adv. Exp. Med. Biol. 893, 1–19 (2016).

    Article  PubMed  Google Scholar 

  42. Newman, A. M. et al. Integrated digital error suppression for improved detection of circulating tumor DNA. Nat. Biotechnol. 34, 547–555 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Travis, W. D. et al. The 2015 World Health Organization Classification of lung tumors: impact of genetic, clinical and radiologic advances since the 2004 classification. J. Thorac. Oncol. 10, 1243–1260 (2015).

    Article  PubMed  Google Scholar 

  44. Reck, M. & Rabe, K. F. Precision diagnosis and treatment for advanced non-small-cell lung cancer. N. Engl. J. Med. 377, 849–861 (2017).

    Article  CAS  PubMed  Google Scholar 

  45. Ettinger, D. S. et al. NCCN guidelines insights: non-small cell lung cancer, version 1.2020. J. Natl Compr. Cancer Netw. 17, 1464–1472 (2019).

  46. Wiener, R. S., Schwartz, L. M., Woloshin, S. & Welch, H. G. Population-based risk for complications after transthoracic needle lung biopsy of a pulmonary nodule: an analysis of discharge records. Ann. Intern. Med. 155, 137–144 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  47. Bubendorf, L., Lantuejoul, S., de Langen, A.J. & Thunnissen, E. Nonsmall cell lung carcinoma: diagnostic difficulties in small biopsies and cytological specimens: number 2 in the series ‘Pathology for the clinician’ edited by Peter Dorfmuller and Alberto Cavazza. Eur. Respir. Rev. 26, 170007 (2017).

  48. McLean, A. E. B., Barnes, D. J. & Troy, L. K. Diagnosing lung cancer: the complexities of obtaining a tissue diagnosis in the era of minimally invasive and personalised medicine. J. Clin. Med. 7, 163 (2018).

    Article  CAS  PubMed Central  Google Scholar 

  49. Reck, M. et al. Pembrolizumab versus chemotherapy for PD-L1-positive non-small-cell lung cancer. N. Engl. J. Med. 375, 1823–1833 (2016).

    Article  CAS  PubMed  Google Scholar 

  50. Socinski, M. A. et al. Atezolizumab for first-line treatment of metastatic nonsquamous NSCLC. N. Engl. J. Med. 378, 2288–2301 (2018).

    Article  CAS  PubMed  Google Scholar 

  51. Gandhi, L. et al. Pembrolizumab plus chemotherapy in metastatic non-small-cell lung cancer. N. Engl. J. Med. 378, 2078–2092 (2018).

    Article  CAS  PubMed  Google Scholar 

  52. Hellmann, M. D. et al. Nivolumab plus Ipilimumab in lung cancer with a high tumor mutational burden. N. Engl. J. Med. 378, 2093–2104 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Camidge, D. R., Doebele, R. C. & Kerr, K. M. Comparing and contrasting predictive biomarkers for immunotherapy and targeted therapy of NSCLC. Nat. Rev. Clin. Oncol. 16, 341–355 (2019).

    Article  CAS  PubMed  Google Scholar 

  54. Nabet, B. Y. et al. Noninvasive early identification of therapeutic benefit from immune checkpoint inhibition. Cell 183, 363–376 e313 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Menon, M. P., Pittaluga, S. & Jaffe, E. S. The histological and biological spectrum of diffuse large B-cell lymphoma in the World Health Organization classification. Cancer J. 18, 411–420 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Sehn, L. H. et al. The revised International Prognostic Index (R-IPI) is a better predictor of outcome than the standard IPI for patients with diffuse large B-cell lymphoma treated with R-CHOP. Blood 109, 1857–1861 (2007).

    Article  CAS  PubMed  Google Scholar 

  57. Alizadeh, A. A. et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000).

    Article  CAS  PubMed  Google Scholar 

  58. Pasqualucci, L. et al. Analysis of the coding genome of diffuse large B-cell lymphoma. Nat. Genet. 43, 830–837 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Cottereau, A. S. et al. Molecular profile and FDG-PET/CT total metabolic tumor volume improve risk classification at diagnosis for patients with diffuse large B-cell lymphoma. Clin. Cancer Res. 22, 3801–3809 (2016).

    Article  CAS  PubMed  Google Scholar 

  60. Scherer, F. et al. Distinct biological subtypes and patterns of genome evolution in lymphoma revealed by circulating tumor DNA. Sci. Transl. Med. 8, 364ra155 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Kurtz, D. M. et al. Circulating tumor DNA measurements as early outcome predictors in diffuse large B-cell lymphoma. J. Clin. Oncol. 36, 2845–2853 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Rosenwald, A. et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N. Engl. J. Med. 346, 1937–1947 (2002).

    Article  PubMed  Google Scholar 

  63. Basso, K. & Dalla-Favera, R. Germinal centres and B cell lymphomagenesis. Nat. Rev. Immunol. 15, 172–184 (2015).

    Article  CAS  PubMed  Google Scholar 

  64. Dunleavy, K. et al. Differential efficacy of bortezomib plus chemotherapy within molecular subtypes of diffuse large B-cell lymphoma. Blood 113, 6069–6076 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Thieblemont, C. et al. The germinal center/activated B-cell subclassification has a prognostic impact for response to salvage therapy in relapsed/refractory diffuse large B-cell lymphoma: a bio-CORAL study. J. Clin. Oncol. 29, 4079–4087 (2011).

    Article  PubMed  Google Scholar 

  66. Scott, D. W. et al. Determining cell-of-origin subtypes of diffuse large B-cell lymphoma using gene expression in formalin-fixed paraffin-embedded tissue. Blood 123, 1214–1217 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Nowakowski, G. S. et al. Lenalidomide combined with R-CHOP overcomes negative prognostic impact of non-germinal center B-cell phenotype in newly diagnosed diffuse large B-cell lymphoma: a phase II study. J. Clin. Oncol. 33, 251–257 (2015).

    Article  CAS  PubMed  Google Scholar 

  68. Wilson, W. H. et al. Targeting B cell receptor signaling with ibrutinib in diffuse large B cell lymphoma. Nat. Med. 21, 922–926 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Young, R. M. & Staudt, L. M. Targeting pathological B cell receptor signalling in lymphoid malignancies. Nat. Rev. Drug Discov. 12, 229–243 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Lenz, G. et al. Stromal gene signatures in large-B-cell lymphomas. N. Engl. J. Med. 359, 2313–2323 (2008).

    Article  CAS  PubMed  Google Scholar 

  71. Zelenetz, A. D. et al. NCCN guidelines insights: B-cell lymphomas, version 3.2019. J. Natl Compr. Cancer Netw. 17, 650–661 (2019).

  72. Hans, C. P. et al. Confirmation of the molecular classification of diffuse large B-cell lymphoma by immunohistochemistry using a tissue microarray. Blood 103, 275–282 (2004).

    Article  CAS  PubMed  Google Scholar 

  73. Lossos, I. S. et al. Prediction of survival in diffuse large-B-cell lymphoma based on the expression of six genes. N. Engl. J. Med. 350, 1828–1837 (2004).

    Article  CAS  PubMed  Google Scholar 

  74. Malumbres, R. et al. Paraffin-based 6-gene model predicts outcome in diffuse large B-cell lymphoma patients treated with R-CHOP. Blood 111, 5509–5514 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Alizadeh, A. A., Gentles, A. J., Lossos, I. S. & Levy, R. Molecular outcome prediction in diffuse large-B-cell lymphoma. N. Engl. J. Med. 360, 2794–2795 (2009).

    Article  CAS  PubMed  Google Scholar 

  76. Alizadeh, A. A. et al. Prediction of survival in diffuse large B-cell lymphoma based on the expression of 2 genes reflecting tumor and microenvironment. Blood 118, 1350–1358 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Chapuy, B. et al. Molecular subtypes of diffuse large B cell lymphoma are associated with distinct pathogenic mechanisms and outcomes. Nat. Med. 24, 679–690 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. Ennishi, D. et al. Double-Hit gene expression signature defines a distinct subgroup of germinal center B-cell-like diffuse large B-cell lymphoma. J. Clin. Oncol. 37, 190–201 (2019).

    Article  CAS  PubMed  Google Scholar 

  79. Gentles, A. J. & Alizadeh, A. A. A few good genes: simple, biologically motivated signatures for cancer prognosis. Cell Cycle 10, 3615–3616 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  80. Chambers, J. & Rabbitts, T. H. LMO2 at 25 years: a paradigm of chromosomal translocation proteins. Open Biol. 5, 150062 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  81. Royer-Pokora, B. et al. The TTG-2/RBTN2 T cell oncogene encodes two alternative transcripts from two promoters: the distal promoter is removed by most 11p13 translocations in acute T cell leukaemia’s (T-ALL). Oncogene 10, 1353–1360 (1995).

    CAS  PubMed  Google Scholar 

  82. Oram, S. H. et al. A previously unrecognized promoter of LMO2 forms part of a transcriptional regulatory circuit mediating LMO2 expression in a subset of T-acute lymphoblastic leukaemia patients. Oncogene 29, 5796–5808 (2010).

    Article  CAS  PubMed  Google Scholar 

  83. Boehm, T. et al. An unusual structure of a putative T cell oncogene which allows production of similar proteins from distinct mRNAs. EMBO J. 9, 857–868 (1990).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. Smale, S. T. & Kadonaga, J. T. The RNA polymerase II core promoter. Annu. Rev. Biochem. 72, 449–479 (2003).

    Article  CAS  PubMed  Google Scholar 

  85. Bernstein, B. E. et al. Genomic maps and comparative analysis of histone modifications in human and mouse. Cell 120, 169–181 (2005).

    Article  CAS  PubMed  Google Scholar 

  86. Wong, I. H. et al. Detection of aberrant p16 methylation in the plasma and serum of liver cancer patients. Cancer Res. 59, 71–73 (1999).

    CAS  PubMed  Google Scholar 

  87. Chim, S. S. et al. Detection of the placental epigenetic signature of the maspin gene in maternal plasma. Proc. Natl Acad. Sci. USA 102, 14753–14758 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. Fernandez, A. F. et al. A DNA methylation fingerprint of 1628 human samples. Genome Res. 22, 407–419 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  89. Houseman, E. A. et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics 13, 86 (2012).

  90. Chan, K. C. et al. Noninvasive detection of cancer-associated genome-wide hypomethylation and copy number aberrations by plasma DNA bisulfite sequencing. Proc. Natl Acad. Sci. USA 110, 18761–18768 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  91. Lun, F. M. et al. Noninvasive prenatal methylomic analysis by genomewide bisulfite sequencing of maternal plasma DNA. Clin. Chem. 59, 1583–1594 (2013).

    Article  CAS  PubMed  Google Scholar 

  92. Ou, X. et al. Epigenome-wide DNA methylation assay reveals placental epigenetic markers for noninvasive fetal single-nucleotide polymorphism genotyping in maternal plasma. Transfusion 54, 2523–2533 (2014).

    Article  CAS  PubMed  Google Scholar 

  93. Jensen, T. J. et al. Whole genome bisulfite sequencing of cell-free DNA and its cellular contributors uncovers placenta hypomethylated domains. Genome Biol. 16, 78 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  94. Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

  95. Visel, A. et al. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature 457, 854–858 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  96. Koh, W. et al. Noninvasive in vivo monitoring of tissue-specific global gene expression in humans. Proc. Natl Acad. Sci. USA 111, 7361–7366 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  97. Srinivasan, S. et al. Small RNA sequencing across diverse biofluids identifies optimal methods for exRNA isolation. Cell 177, 446–462 e416 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  98. Ibarra, A. et al. Non-invasive characterization of human bone marrow stimulation and reconstitution by cell-free messenger RNA sequencing. Nat. Commun. 11, 400 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  99. Zhou, Z. et al. Extracellular RNA in a single droplet of human serum reflects physiologic and disease states. Proc. Natl Acad. Sci. USA 116, 19200–19208 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  100. Verwilt, J. et al. When DNA gets in the way: a cautionary note for DNA contamination in extracellular RNA-seq studies. Proc. Natl Acad. Sci. USA 117, 18934–18936 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  101. Schmitz, R. et al. Genetics and pathogenesis of diffuse large B-cell lymphoma. N. Engl. J. Med. 378, 1396–1407 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  102. Gentles, A. J. et al. The prognostic landscape of genes and infiltrating immune cells across human cancers. Nat. Med. 21, 938–945 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  103. Binkley, M. S. et al. KEAP1/NFE2L2 mutations predict lung cancer radiation resistance that can be targeted by glutaminase inhibition. Cancer Discov. 10, 1826–1841 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  104. Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  105. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  106. Alig, S. et al. Short diagnosis-to-treatment interval is associated with increased tumor burden measured by circulating tumor DNA and metabolic tumor volume in diffuse large B-cell lymphoma. J. Clin. Oncol. 39, 2605–2616 (2021).

  107. George, J. et al. Comprehensive genomic profiles of small cell lung cancer. Nature 524, 47–53 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  108. U, M.Talevich, E., Katiyar, S., Rasheed, K. & Kannan, N. Prediction and prioritization of rare oncogenic mutations in the cancer Kinome using novel features and multiple classifiers. PLoS Comput. Biol. 10, e1003545 (2014).

  109. Venkatraman, E. S. & Olshen, A. B. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 23, 657–663 (2007).

    Article  CAS  PubMed  Google Scholar 

  110. Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  111. Newman, A. M. et al. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat. Biotechnol. 37, 773–782 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank the patients and families who participated in this study, and we are grateful for all the insightful discussions with the members of the Alizadeh and Diehn laboratories. Funding sources: this work was supported by the National Cancer Institute (M.S.E., grant no. R25CA180993; M.S.E., grant no. T32CA009302; M.M., grant no. F99CA212457, M.D. and A.A.A., grant nos. R01CA188298, R01CA254179, R01CA257655, R01CA244526, R01CA233975, and R01CA229766), the US National Institutes of Health Director’s New Innovator Award Program (M.D.; grant no. 1-DP2-CA186569), the Virginia and D.K. Ludwig Fund for Cancer Research (M.D. and A.A.A.), and philanthropic support from the Bakewell Foundation (to M.D. and A.A.A.), the CRK Faculty Scholar Fund (M.D.), the Troper-Wojcicki Family Gift, The Shanahan Family Lymphoma Fund, Arzang Family Lymphoma Fund, The Skeff Family Lymphoma Fund, The Cane-Nowak Family Foundation, The Marc Benioff Fund, Jewish Communal Fund for Lymphoma Research, The Sara Schottenstein Memorial Fund, and The Moghadam Family Endowed Professorship (to A.A.A.). A.A.A. is a scholar of the Leukemia and Lymphoma Society. J.M.I., C.M.R. and B.T.L. are supported by the MSKCC Support grant no. P30 CA008748 from the NIH. B.Y.N. is a Stanford Cancer Systems Biology Scholar and supported by the NIH (grant no. 5R25CA180993) and by the Postdoctoral Research Fellowship (grant no. 134031-PF-19-164-01-TBG) from the American Cancer Society. Present address for M.M. is the Department of Biotechnology, College of Science, University of Tehran, Iran, and for D.A.K. is the Northwell Health Cancer Institute and Feinstein Institute of Research, Lake Success, NY, USA.

Author information

Authors and Affiliations

Authors

Contributions

M.S.E., M.M., M.D. and A.A.A. developed the concept of EPIC-seq. M.S.E., M.D. and A.A.A. developed the concept of PFE, designed the experiments, analyzed the data and wrote the manuscript. M.S.E. and M.M. designed the EPIC-seq selector with inputs from H.S., A.M.N., M.D. and A.A.A. M.S.E. performed all the statistical and machine learning analyses and the mechanistic modeling with assistance from M.M., B.C. and C.B.S. M.S.E., M.M., E.G.H. and A.S. performed the bioinformatics analyses with assistance from D.A.K., C.L.L., M.C.N. and J.J.C. E.G.H., C.W.M., J.S., B.Y.N., A.B.-Y.H., J.G.S-M. and E.J.M. performed molecular biology experiments related to profiling clinical specimens with assistance from J.J.C. S.K.A., B.J.S., B.Y.N., M.J.F., J.M.I., C.M.R., B.T.L., D.M.K., M.D. and A.A.A. provided patient specimens and clinical data. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Maximilian Diehn or Ash A. Alizadeh.

Ethics declarations

Competing interests

A.A.A. reports research funding from Celgene, Pfizer, ownership interests in FortySeven, CiberMed, ForeSight and paid consultancy from Roche, Genentech, Janssen, Pharmacyclics, Gilead, Celgene and Chugai. M.D. reports research funding from Varian Medical Systems and Illumina, ownership interest in CiberMed, ForeSight and paid consultancy from Roche, AstraZeneca, Illumina, RefleXion and BioNTech. J.J.C. reports paid consultancy from Lexent Bio Inc. and ownership interests in ForeSight. A.M.N. has patent filings related to expression deconvolution and cancer biomarkers and has served as a consultant for Roche, Merck and CiberMed. D.M.K. reports paid consultancy from Roche. B.T.L. has served as an uncompensated advisor and consultant to Amgen, Genentech, Boehringer Ingelheim, Lilly, AstraZeneca and Daiichi Sankyo. B.T.L. reports receiving research grants to his institution from Amgen, Genentech, AstraZeneca, Daiichi Sankyo, Lilly, Illumina, GRAIL, Guardant Health, Hengrui Therapeutics, MORE Health and Bolt Biotherapeutics. B.T.L. has received academic travel support from MORE Health and Jiangsu Hengrui Medicine. B.T.L. reports to be inventor on two institutional patents at MSKCC (US62/685,057, US62/514,661) and has intellectual property rights as a book author at Karger Publishers and Shanghai Jiao Tong University Press. J.M.I. reports serving as an unpaid consultant to Amgen and Roche-Genentech, institutional research support from Guardant Health and GRAIL, and ownership interest in LumaCyte. A.A.A., M.D., M.S.E., D.M.K., J.J.C., and B.Y.N. report patent filings related to cancer biomarkers. M.S.E., M.M., A.A.A. and M.D. have patent filing related to this paper. B.Y.N. is currently an employee and holds stock from Roche/Genentech. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Fragment length density at the transcription start sites varies with gene expression.

Fragment length density at the transcription start sites varies with gene expression. (a) A heatmap of fragment length densities across 1,748 groups of genes (similar to Fig. 1a). Three regions R1 (100–150 bps), R2 (151–210 bps), and R3 (211–300 bps) show enrichment in either high or low expression gene groups. (b) The percent of fragments within each region defined in panel (a) in the deep whole-genome sample across deciles of the reference PBMC gene expression vector, that is, 10 groups of genes when sorted by their expression values in PBMC. Highly expressed genes include fewer monosome fragments, indicating a wider distribution and thereby a higher PFE. (c) Fraction of fragments within the three regions, R1-R3, for exons vs introns vs TSS sites for the top (and bottom) 2000 genes as ranked by expression. The fraction of monosomal fragments within TSS regions is substantially lower than within intronic and exonic regions (63.5% at TSS vs ~71% at non-TSS). Pearson’s Chi-Squared goodness-of-fit tests resulted in the following test statistics (TSS vs Exon: G = 62,133 [P < 2.2E-16]; TSS vs Intron: G = 84,110 [P < 2.2E-16]). (d) Fraction of fragments falling within each region (R1, R2, and R3) for mutant cfDNA fragments and their wildtype counterparts. Each dot represents one tuple (variant-patient) and the connecting lines indicate the paired mutant-wildtype status. These results show that the mutant cfDNA fragments are enriched for R1 and R3 while wildtype fragments are enriched in R2. (e) A contour plot capturing the relationship between expression level (depicted by heat) as a function of two cfDNA fragmentomic features used in the gene inference model: PFE and NDR. (f) ROC analysis of a ‘NSCLC Score’ for noninvasively distinguishing patients with NSCLC from healthy controls (AUC = 0.76). The genes comprising this score were first defined from external RNA-Seq profiling data of primary NSCLC tumor tissues and blood samples, allowing subsequent calculation of their corresponding PFE in cfDNA samples profiled by WGS for independent NSCLC cases and healthy controls. (g) A schematic for the analyses performed for Fig. 2d–h. (h) Sample-level ‘SCLC Score’ from deep whole exome analysis of cfDNA and associated diagnostic performance. As in the exercise for NSCLC depicted in panel f, the genes comprising this SCLC score were first defined from external RNA-Seq profiling data of primary SCLC tumor tissues and blood samples. The corresponding PFEs (as the difference between the overall PFE level of top and bottom gene signatures) were subsequently calculated in cfDNA samples we profiled by deep WES for independent SCLC cases and healthy controls. Using these scores, an AUC of 0.9 was achieved in distinguishing cases from controls. (i) The Venn diagram of SCLC high genes identified in cfDNA (whole exome profiling) and tumor biopsy (RNA-Seq transcriptome profiling), with significance of overlap assessed by hypergeometric test.

Extended Data Fig. 2 Ensemble model accurately predicts gene expression in validation samples.

Ensemble model accurately predicts gene expression in validation samples. (a) The scatterplot of the predicted vs a population-averaged gene expression across 1,748 groups of genes. The underlying data are from a merged cfDNA ‘meta-sample’ (pooled from merger of 27 healthy subjects profiled by relatively shallow WGS), achieving a correlation of 0.9 in initial validation. (b) The meta sample from panel (a) was used to assess model performance, when considering TSS-level expression values without gene grouping (n = 1), as well as scenarios with 2, 3, 5 and 10 genes per group. The Pearson correlation between observed expression in PBMC versus predicted expression from our model (combining PFE and NDR) is shown in green bars. This correlation substantially improves as number of genes per group increases. The Pearson correlation values between observed gene expression and those predicted by NDR or PFE expression are shown in blue and green bars, respectively. (c) Scatterplot depicts predicted versus observed gene expression measurements across 1,748 groups of genes (dots), when comparing expression measurements by RNA-Seq on matched PBMC (x-axis) against plasma cfDNA inferences (y-axis), for a validation sample from a healthy adult that we also profiled by deep WGS (~200x). This achieved a Pearson correlation of 0.86. (d) Similar to panel c, but for a second healthy adult control subject also profiled for validation, by deep WGS of cfDNA and matched RNA-Seq of PBMC (Pearson r = 0.91). (e-f) The same analysis as in panels (a-b) for a meta whole-genome sample generated from healthy subjects from Zviran et al. (g) The whole genome samples (depth ~20-40x) from Zviran et al. were used with every ten genes grouped and the concordance between model-predicted expression and PBMC expression are evaluated using Pearson correlation (that is, each dot is one subject). The non-cancer samples show a significantly higher correlation with normal PBMC than lung cancer cases (Wilcoxon P = 0.018). (h) The ichorCNA tumor fraction estimates of the lung cancer cases in panel f are used to compare with the correlations in panel f. As shown in a scatterplot, as tumor fraction increases, the correlation decreases (r = −0.69, P = 0.00052).

Extended Data Fig. 3 Case-level information of samples profiled by EPIC-Seq.

Cohorts and cell-free DNA samples profiled by EPIC-seq in this study, including Cancer Cases and Control Subjects. (a) Schema depicts the full set of specimens profiled by EPIC-Seq (n = 373), including those meeting Quality Control (QC) criteria (n = 352, 95%). A subset of samples were used for the initial gene expression model tuning (n = 2) and TSS filtering (n = 21). The remaining 329 samples were profiled by EPIC-Seq to address disease-specific questions, including utility for cancer detection, classification of histology and cell-of-origin, and response monitoring. These included 252 samples (76.6%) from 226 subjects that comprised our Discovery/Training cohort (large light purple rectangle), as well as subsequent profiling of a Validation Cohort of 77 samples (23.4%) from 69 subjects, after models were ‘locked down’ (large light green rectangle). A subset of 22 NSCLC patients where a pair of serial blood samples were monitored for ICI response (to allow comparisons of both EPIC-Seq and CAPP-Seq and assess biological plausibility), but this exercise was not subject to any model training. No samples were shared between Training and Validation exercises, with all models locked down before independent validations. Four healthy subjects (4.5%) provided more than one cfDNA specimen with one used for Training and the second for Validation. (b) Distribution of demographic, clinical, anatomic, and pathological variables for subjects profiled by EPIC-Seq. Tabulated are the relevant indices for cancer cases (235 blood samples 201 patients), including NSCLC patients (light blue; 109 blood samples from 87 patients), DLBCL patients (light orange; 126 blood samples from 104 patients), and non-cancer control subjects (gray; 94 blood samples from 87 adults).

Extended Data Fig. 4 Correlation between EPIC-lung score and clinical factors.

Concordance between EPIC-Seq measurements and established NSCLC risk factors including metabolic tumor burden, ctDNA level, and ctDNA response. (a) Concordance between EPIC-lung score and metabolic tumor volume (MTV), as measured by Spearman correlation (ρ = 0.67; P = 0.04). (b) Concordance between EPIC-lung score and the ctDNA mean allele fractions as measured by CAPP-Seq, evaluated using Spearman correlation (ρ = 0.5; P = 3E-5). (c) Relationships between genetic versus epigenetic molecular responses to Immune Checkpoint Inhibitor (ICI) therapy in advanced NSCLC. Scatterplot compares molecular responses measured noninvasively by CAPP-Seq (x-axis; fold change, Log10) and EPIC-Seq (lung dynamics score; y-axis) using serial plasma profiling before and after ICI therapy. The two orthogonal measures show moderate but significant correlation (r = 0.53, P = 0.012).

Extended Data Fig. 5 Correlation between EPIC-lymphoma score and clinical factors, results of the validation set and prognostic value of the LMO2 distal promoter.

Concordance between EPIC-Seq measurements and established DLBCL risk factors impacting outcome, including metabolic tumor volume, ctDNA level, and Cell-of-Origin. (a) The boxplots illustrate the two groups of patients stratified by their metabolic tumor volumes (>220 vs <220 mL; Wilcoxon P = 0.015). (b) Similar to panel a, but for the DLBCL Validation Cohort. (c) Concordance between EPIC-DLBCL scores and ctDNA mean allele fractions (from CAPP-Seq), evaluated using Spearman correlation (ρ = 0.66; P < 2E-16). (d) The EPIC-DLBCL model is applied to the cfDNA profiles of 13 samples from two DLBCL patients (DLBCL002 [ABC] and DLBCL007 [GCB]). The concordance between the resulting scores and the ctDNA mean allele fractions is evaluated by Spearman correlation (ρ = 0.79; P = 0.004). (e) Relationship between DLBCL cell-of-origin EPIC-Seq GCB scores and mutation-based GCB scores as measured by CAPP-Seq in the validation set (Spearman ρ = 0.64, P = 0.01). Each dot represents one sample (related to Fig. 6a). (f) Relationship between EPIC-Seq GCB scores from cfDNA and matched tumor tissue classification by routine Hans immunohistochemical algorithm in the validation set (Wilcoxon P = 0.001; related to Fig. 6b). (g) Relationship between EPIC-Seq GCB scores from cfDNA and tumor classification by RNA-seq of paired tumor tissue (Jonckheere’s trend test, P = 0.015). Box-and-whisker plots depict the EPIC-Seq GCB score in individual samples profiled by EPIC-Seq (dots), with boxes spanning the inter-quartile range; the median is horizontally marked with a line in each box, and whiskers span the 1.5 IQRs. (h) The Kaplan-Meier curves of EFS of the patients when labeled by the Hans algorithm. The non-GCB group contains both Non-GCB and Unknown. (i) The violin plot shows the distributions of Cox Proportional Hazard Model Z-scores when genes are grouped according to their effects on outcome (measured as EFS) in three prior tumor studies.

Extended Data Fig. 6 Pre-analytical factors and TSS GC-content correction effect on PFE.

Effect of preanalytical factors on fragment size entropy and effect of GC-content correction on expression model performance. (a) The concordance between PFE values for three healthy controls profiled by EPIC-Seq using paired Streck BCT and K2EDTA tubes. A Pearson correlation of 0.94 was observed between tube types. (b) Effect of time on the bench (that is, in days) on the PFEs in a cohort of plasma cfDNA samples. (c) Effect of additional PCR cycles on PFE. Here we profiled 4 healthy control cfDNA samples by the CAPP-Seq lung cancer selector when 3 additional PCR cycles were included to study their effect. A Pearson correlation of 0.95 was observed between standard conditions versus those incorporating additional PCR cycles. (d) Effect of correction for GC-content of TSS regions on gene expression model accuracy. Four scenarios were studied when correcting features using the GC values for NDR and PFE: PFE alone corrected, NDR alone corrected, both corrected, and neither corrected. The correction was performed using a LOESS function with a span of 0.5. Two healthy control cfDNA samples were profiled by deep whole genome sequencing. For these two subjects, we also profiled the matched PBMC by RNA-Sequencing. We then compared the predicted values from cfDNA against observed values from RNA-Seq for each of the different GC-correction scenarios and tested concordance. The concordance was evaluated using three metrics: Pearson correlation, Spearman correlation, and root-mean-square error (RMSE). When considering both cfDNA samples, none of the four GC-correction approaches seemed to consistently improve correlations or reduce associated error profiles. (e) Whole exome profiling of small-cell lung cancer samples in Fig. 2 are used to investigate association between PFEs and copy number aberrations. We first determined genes with PFE significantly higher in SCLC cfDNA samples (n = 11) compared with healthy control cfDNA samples (n = 28) (‘High’ PFE). Similarly, we determined genes with significantly lower PFEs in SCLC cfDNA samples (‘Low’ PFE). Then, the copy number states (CNS) corresponding to all genes were identified by overlapping copy number profiles from CNVkit with the genomic coordinates of the first exons. The CNS values were then dichotomized into (i) amplification vs no-amplification and (ii) deletion vs no-deletion. Next, we summarized these by contingency tables for (i) vs PFE levels (top table) and (ii) vs PFE levels (bottom table). Finally, the association between the two was examined via Fisher’s exact test, which showed insignificant associations in both tests (P = 0.97 and P = 0.17; for amplifications and deletions, respectively).

Extended Data Fig. 7

Mechanistic model and gene detection sensitivity with various parameters. (a) The cartoon shows four scenarios considered in our simulations: (i) protected, meaning that nucleosomes are well-positioned and are all present, (ii) one nucleosome-free position is present, (iii) two nucleosome-free positions are present and (iv) three nucleosome-free positions are present. (b) The density plots show the results of generating fragment lengths via the model described in panel a. Three panels correspond to scenarios (ii-iv) vs (i) in a. (c) A varying mixture parameters is considered and its effect on the entropy for three different coverages: 500x, 2500x and 5000x. (d) A summary of panel c for active gene detection sensitivity while achieving a specificity of 85%. The error bars are from the sensitivities calculated using the ‘ci.se’ function in R pROC package. The colors correspond to three different coverages in panel c.

Supplementary information

Supplementary Information

Supplementary Information.

Reporting Summary

Supplementary Tables 1–4

Supplementary Table 1: List of samples analyzed or profiled in this study. Supplementary Table 2: Gene groups and their corresponding features in the deep whole-genome sample. Supplementary Table 3: Targeted TSSs in the EPIC-seq panel. Supplementary Table 4: Clinical characteristics of the EPIC-seq cohort and scores from different classifiers.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Esfahani, M.S., Hamilton, E.G., Mehrmohamadi, M. et al. Inferring gene expression from cell-free DNA fragmentation profiles. Nat Biotechnol 40, 585–597 (2022). https://doi.org/10.1038/s41587-022-01222-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-022-01222-4

This article is cited by

Search

Quick links

Nature Briefing: Cancer

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

Get what matters in cancer research, free to your inbox weekly. Sign up for Nature Briefing: Cancer