Inferring expressed genes by whole-genome sequencing of plasma DNA

Journal name:
Nature Genetics
Volume:
48,
Pages:
1273–1278
Year published:
DOI:
doi:10.1038/ng.3648
Received
Accepted
Published online

The analysis of cell-free DNA (cfDNA) in plasma represents a rapidly advancing field in medicine. cfDNA consists predominantly of nucleosome-protected DNA shed into the bloodstream by cells undergoing apoptosis. We performed whole-genome sequencing of plasma DNA and identified two discrete regions at transcription start sites (TSSs) where nucleosome occupancy results in different read depth coverage patterns for expressed and silent genes. By employing machine learning for gene classification, we found that the plasma DNA read depth patterns from healthy donors reflected the expression signature of hematopoietic cells. In patients with cancer having metastatic disease, we were able to classify expressed cancer driver genes in regions with somatic copy number gains with high accuracy. We were able to determine the expressed isoform of genes with several TSSs, as confirmed by RNA-seq analysis of the matching primary tumor. Our analyses provide functional information about cells releasing their DNA into the circulation.

At a glance

Figures

  1. Plasma DNA fragment size and patterns of nucleosome positioning.
    Figure 1: Plasma DNA fragment size and patterns of nucleosome positioning.

    (a) Size distribution calculated from paired-end sequencing data of nuclear (red; based on ~110,000 reads) and mitochondrial (blue; based on ~53,000 reads) plasma DNA fragments. (b) Schematic of the coverage patterns caused by nucleosomal protection of cfDNA. Nuclear chromatin digested by MNase is enriched for DNA protected by nucleosomes (dark blue) and depleted for connecting linker regions (light blue). In MNase assays, regions with perfectly positioned nucleosomes have strong coverage peaks reflecting the phasing of nucleosomes (left), whereas the pattern of read depth is different for regions with less preferentially positioned nucleosomes (right) (adapted from ref. 13). (c) Ideogram of chromosome 12 with enlargement of 12p11.1, which contains an extreme example of sequence-directed nucleosome positioning10. Read depth coverage for plasma DNA fragments from female and male donors (n = 154; merged data) is shown in blue, and the MNase midpoint density map from cell line GM12878 is shown in red. To the right is a comparison between plasma DNA read depth and the MNase midpoint density map, demonstrating a strong correlation (Pearson: 0.709, P < 2.2 × 10−16; Spearman: 0.708, P < 2.2 × 10−16).

  2. Nucleosome positioning at transcription start sites.
    Figure 2: Nucleosome positioning at transcription start sites.

    (a) Sequencing coverage at promoter sites for housekeeping (red) and unexpressed (blue) genes generated with plasma samples from 104 donors. The coverage pattern reflects nucleosome organization. At the start of transcription, nucleosomes are removed to create an NDR over the promoter, allowing transcription factors to bind11, 12, 13. The reduction in nucleosome occupancy for expressed housekeeping genes corresponded to decreased coverage (x axis, distance from the TSS; y axis, relative coverage reflecting nucleosome dyads). (b) MNase midpoint density maps in the GM12878 cell line for the 1,000 (representing 1,334 TSSs) most highly expressed genes (red) and the 1,000 (representing 1,109 TSSs) least expressed genes (blue) in blood, identified on the basis of published plasma RNA-seq data18. Shaded areas represent 95% confidence intervals. (c) Plasma DNA read depth maps for the promoter regions of the genes in b (red, 1,000 most highly expressed genes; blue, 1,000 least expressed genes). (d) Plasma DNA read depth patterns at the promoters of differently expressed genes (red, FPKM >8; orange, FPKM >1 and ≤8; light blue, FPKM >0.1 and ≤1; dark blue, FPKM ≤0.1).

  3. Classification of expressed and silent genes by plasma DNA read depth analyses.
    Figure 3: Classification of expressed and silent genes by plasma DNA read depth analyses.

    (a) Kernel density estimation identified two separate gene clusters based on normalized coverage patterns at 2K-TSS and NDR regions. (b) SVM classification based on normalized 2K-TSS and NDR coverage for the 100 (left) and 1,000 (right) most highly and least expressed genes. Red and dark blue circles represent genes correctly predicted to be expressed and unexpressed, respectively, whereas light blue and orange circles represent incorrectly predicted genes. (c) Box plots showing that the difference in FPKM values between genes predicted to be expressed (n = 11,345) and unexpressed (n = 9,156) is statistically highly significant (expressed: median = 4.67, s.d. = 675.3; unexpressed: median = 0.13, s.d. = 97.0; Mann–Whitney U test, two-sided, P < 2.2 × 10−16). Each box comprises data from the first to the third quartile (interquartile range, IQR) and the median. Whiskers extend to 1.5 × IQR from the box. (d) Example promoter coverage patterns for the NCL (red) and GABRR3 (blue) genes, which are expressed with mean FPKM values of 2,000 and <0.5, respectively. The vertical bars delimit the regions from TSS – 1,000 bp to TSS + 1,000 bp. (e) Averaging FPKM percentiles within integer bins of 2K-TSS and NDR coverage showed a quantitative relationship between these two coverage parameters and gene expression (represented by percentile of FPKM values; for details, see the Supplementary Note).

  4. Procedure for predicting expressed genes in cancer from blood.
    Figure 4: Procedure for predicting expressed genes in cancer from blood.

    (a) Simulation of resolution limits with in silico dilution employing 2K-TSS and NDR coverage mixing the 1,000 most highly expressed genes with random parameters from the distribution of the 1,000 least expressed genes in plasma (blue, accuracy of >70%; orange, accuracy of 50–70%; red, accuracy of <50%). (b) Workflow for the identification of expressed cancer driver genes in peripheral blood. Matching primary tumor tissue was synchronously obtained with blood samples. CNAs from both the primary tumor and plasma DNA were established for comparison. Expression patterns in the primary tumor were analyzed by RNA-seq and correlated with plasma DNA promoter coverage in relation to the respective copy number status. (c) Copy number profiles of two patients with breast cancer (B7 and B13) from plasma DNA. The x axis shows the chromosomes; the y axis shows log2-transformed copy number ratios. (d) Estimation of tumor purity and ploidy by the quantitative ABSOLUTE method, which estimates tumor purity and ploidy directly from observed relative copy number profiles24, for B7 (left) and B13 (right). (e) Heat map illustrating how regional ctDNA allele frequencies are established in relation to overall ctDNA allele frequency (y axis) and copy number (log2-transformed ratio) (x axis). The black line represents an allele frequency of 75%, which was deemed suitable for gene expression prediction. Regions colored in different shades of red and yellow show the varying levels of ctDNA allele frequency.

  5. Identification of expressed genes in cancer from the peripheral blood.
    Figure 5: Identification of expressed genes in cancer from the peripheral blood.

    (a) Box plots showing FPKM values for genes predicted to be expressed or not expressed in focal amplifications of 11q13.3 (15 TSSs in 15 genes including CCND1; nexpressed = 8 and nunexpressed = 7) in B7 (left) and in both 8p11 (39 TSSs in 31 genes including FGFR1) and 17q12 (59 TSSs in 46 genes including ERBB2) (nexpressed = 87 and nunexpressed = 11) in B13 (right). Blue dots represent genes located in amplicons. Outliers, including CCND1 (FPKM of 50 in B7) and ERBB2 (FPKM of 15 in B13), are not shown because of scaling. The differences were statistically highly significant (B7: expressed: mean = 9.7, s.d. = 17.0; unexpressed: mean = 0.7, s.d. = 0.8; Mann–Whitney U test, P = 0.003; B13: expressed: mean = 5.7, s.d. = 9.7; unexpressed: mean = 1.5, s.d. = 1.8; P = 0.001). (b) Classification accuracy for the 100 most highly expressed genes on chromosomes 1q in B7 and 8p11-qter in B13 as assessed by RNA-seq of the respective primary tumor tissue. (c) Different coverage of ERBB2 (mean for both isoforms) in B13 and control samples around the TSS. (d) ERBB2 has two isoforms (NM_001005862 and NM_004448). Calculation of the differences in Euclidean distance for 2K-TSS and NDR coverage between blood from patients with cancer and healthy controls established that isoform NM_004448 was highly expressed in the tumor from B13. (e) The distances for 2K-TSS (left) and NDR (right) coverage separately confirm that isoform NM_004448 was highly expressed in the tumor from B13.

  6. Mapping of the nucleosome-depleted region.
    Supplementary Fig. 1: Mapping of the nucleosome-depleted region.

    Localization of the NDR, which was mapped by analyses of 100 (red) and 1,000 (orange) highly expressed genes in the 104 plasma samples from healthy donors and which was most often observed in a –150 bp to +50 bp window with respect to the TSS (blue, 1,000 most weakly expressed genes).

  7. Classification of the 5,000 most highly and least expressed genes.
    Supplementary Fig. 2: Classification of the 5,000 most highly and least expressed genes.

    Support vector machine (SVM) classification based on normalized 2K-TSS and NDR coverage for the 5,000 most highly and least expressed genes. Red and dark blue circles represent genes correctly assigned to the expressed and unexpressed clusters, respectively, whereas light blue and orange circles represent incorrectly assigned genes (as in Fig. 3b).

  8. Quantitative relationship between nucleosome occupancy and gene expression.
    Supplementary Fig. 3: Quantitative relationship between nucleosome occupancy and gene expression.

    (a) Correlation between 2K-TSS (left) and NDR (right) coverage and FPKM percentiles. (b) Means and distribution of the 2K-TSS and NDR coverage parameters of genes grouped into deciles. (c) Average FPKM percentile of binned 2K-TSS and NDR coverage parameters.

  9. Comparison of copy number profiles of the matching primary tumor with plasma DNA.
    Supplementary Fig. 4: Comparison of copy number profiles of the matching primary tumor with plasma DNA.

    The copy number profiles of the matching primary tumors B7 (top) and B13 (bottom) were obtained by whole-genome sequencing with a shallow sequencing depth. Pairwise comparisons of genomic position–mapped profiles revealed high correlations between the copy number profiles (Pearson correlation coefficients = 0.74 (B7) and 0.88 (B13)).

  10. Reconstruction of the 12p11.1 nucleosome array with high-coverage sequenced plasma samples.
    Supplementary Fig. 5: Reconstruction of the 12p11.1 nucleosome array with high-coverage sequenced plasma samples.

    Assembly of the 12p11.1 nucleosome arrays in plasma samples from B7, B13, controls, and GM12878 for comparison.

  11. TSS nucleosome occupancy of unexpressed and housekeeping genes in high-coverage sequenced plasma samples in B7 and B13.
    Supplementary Fig. 6: TSS nucleosome occupancy of unexpressed and housekeeping genes in high-coverage sequenced plasma samples in B7 and B13.

    (a,b) Nucleosome occupancy at TSSs of unexpressed genes (fantom.gsc.riken.jp/5/) and housekeeping genes2 had the expected different pattern for B7 (a) and B13 (b).

  12. Distribution of the prediction consent.
    Supplementary Fig. 7: Distribution of the prediction consent.

    Histogram of prediction consent in merged control (n=104) data. For the majority of genes, the prediction consent was above 95%; there are only a few genes with a prediction consent below 75%.

Accession codes

Referenced accessions

NCBI Reference Sequence

References

  1. Schwarzenbach, H., Hoon, D.S. & Pantel, K. Cell-free nucleic acids as biomarkers in cancer patients. Nat. Rev. Cancer 11, 426437 (2011).
  2. Heitzer, E., Auer, M., Ulz, P., Geigl, J.B. & Speicher, M.R. Circulating tumor cells and DNA as liquid biopsies. Genome Med. 5, 73 (2013).
  3. Crowley, E., Di Nicolantonio, F., Loupakis, F. & Bardelli, A. Liquid biopsy: monitoring cancer-genetics in the blood. Nat. Rev. Clin. Oncol. 10, 472484 (2013).
  4. Diaz, L.A. Jr. & Bardelli, A. Liquid biopsies: genotyping circulating tumor DNA. J. Clin. Oncol. 32, 579586 (2014).
  5. Heitzer, E., Ulz, P. & Geigl, J.B. Circulating tumor DNA as a liquid biopsy for cancer. Clin. Chem. 61, 112123 (2015).
  6. Diehl, F. et al. Detection and quantification of mutations in the plasma of patients with colorectal tumors. Proc. Natl. Acad. Sci. USA 102, 1636816373 (2005).
  7. Lo, Y.M. et al. Maternal plasma DNA sequencing reveals the genome-wide genetic and mutational profile of the fetus. Sci. Transl. Med. 2, 61ra91 (2010).
  8. Ramachandran, S. & Henikoff, S. Replicating nucleosomes. Sci. Adv. 1, e1500587 (2015).
  9. Snyder, M.W., Kircher, M., Hill, A.J., Daza, R.M. & Shendure, J. Cell-free DNA comprises an in vivo nucleosome footprint that informs its tissues-of-origin. Cell 164, 5768 (2016).
  10. Gaffney, D.J. et al. Controls of nucleosome positioning in the human genome. PLoS Genet. 8, e1003036 (2012).
  11. Schones, D.E. et al. Dynamic regulation of nucleosome positioning in the human genome. Cell 132, 887898 (2008).
  12. Venkatesh, S. & Workman, J.L. Histone exchange, chromatin structure and the regulation of transcription. Nat. Rev. Mol. Cell Biol. 16, 178189 (2015).
  13. Valouev, A. et al. Determinants of nucleosome organization in primary human cells. Nature 474, 516520 (2011).
  14. Chandrananda, D., Thorne, N.P. & Bahlo, M. High-resolution characterization of sequence signatures due to non-random cleavage of cell-free DNA. BMC Med. Genomics 8, 29 (2015).
  15. Eisenberg, E. & Levanon, E.Y. Human housekeeping genes, revisited. Trends Genet. 29, 569574 (2013).
  16. Lui, Y.Y. et al. Predominant hematopoietic origin of cell-free DNA in plasma and serum after sex-mismatched bone marrow transplantation. Clin. Chem. 48, 421427 (2002).
  17. Sun, K. et al. Plasma DNA tissue mapping by genome-wide methylation sequencing for noninvasive prenatal, cancer, and transplantation assessments. Proc. Natl. Acad. Sci. USA 112, E5503E5512 (2015).
  18. Koh, W. et al. Noninvasive in vivo monitoring of tissue-specific global gene expression in humans. Proc. Natl. Acad. Sci. USA 111, 73617366 (2014).
  19. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 5774 (2012).
  20. Heitzer, E., Ulz, P., Geigl, J.B. & Speicher, M.R. Non-invasive detection of genome-wide somatic copy number alterations by liquid biopsies. Mol. Oncol. 10, 494502 (2016).
  21. Heidary, M. et al. The dynamic range of circulating tumor DNA in metastatic breast cancer. Breast Cancer Res. 16, 421 (2014).
  22. Heitzer, E. et al. Tumor-associated copy number changes in the circulation of patients with prostate cancer identified through whole-genome sequencing. Genome Med. 5, 30 (2013).
  23. Mohan, S. et al. Changes in colorectal carcinoma genomes under anti-EGFR therapy identified by whole-genome plasma DNA sequencing. PLoS Genet. 10, e1004271 (2014).
  24. Carter, S.L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30, 413421 (2012).
  25. Murtaza, M. et al. Non-invasive analysis of acquired resistance to cancer therapy by sequencing of plasma DNA. Nature 497, 108112 (2013).
  26. Ulz, P., Heitzer, E. & Speicher, M.R. Co-occurrence of MYC amplification and TP53 mutations in human cancer. Nat. Genet. 48, 104106 (2016).
  27. Giordano, S.H. et al. Systemic therapy for patients with advanced human epidermal growth factor receptor 2–positive breast cancer: American Society of Clinical Oncology clinical practice guideline. J. Clin. Oncol. 32, 20782099 (2014).
  28. Helsten, T. et al. The FGFR landscape in cancer: analysis of 4,853 tumors by next-generation sequencing. Clin. Cancer Res. 22, 259267 (2016).
  29. Bettegowda, C. et al. Detection of circulating tumor DNA in early- and late-stage human malignancies. Sci. Transl. Med. 6, 224ra24 (2014).
  30. Beroukhim, R. et al. The landscape of somatic copy-number alteration across human cancers. Nature 463, 899905 (2010).
  31. Vogelstein, B. et al. Cancer genome landscapes. Science 339, 15461558 (2013).
  32. Ulz, P. et al. Whole-genome plasma sequencing reveals focal amplifications as a driving force in metastatic prostate cancer. Nat. Commun. 7, 12008 (2016).
  33. Adelman, K. & Lis, J.T. Promoter-proximal pausing of RNA polymerase II: emerging roles in metazoans. Nat. Rev. Genet. 13, 720731 (2012).
  34. Ivanov, M., Baranova, A., Butler, T., Spellman, P. & Mileyko, V. Non-random fragmentation patterns in circulating cell-free DNA reflect epigenetic regulation. BMC Genomics 16 (Suppl. 13), S1 (2015).
  35. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 17541760 (2009).
  36. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 20782079 (2009).
  37. Lai, W., Choudhary, V. & Park, P.J. CGHweb: a tool for comparing DNA copy number segmentations from multiple algorithms. Bioinformatics 24, 10141015 (2008).
  38. Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562578 (2012).

Download references

Author information

Affiliations

  1. Institute of Human Genetics, Medical University of Graz, Graz, Austria.

    • Peter Ulz,
    • Martina Auer,
    • Ricarda Graf,
    • Jochen B Geigl,
    • Ellen Heitzer &
    • Michael R Speicher
  2. Institute of Molecular Biotechnology, Graz University of Technology, Graz, Austria.

    • Gerhard G Thallinger
  3. BioTechMed OMICS Center Graz, Graz, Austria.

    • Gerhard G Thallinger
  4. Institute of Pathology, Medical University of Graz, Graz, Austria.

    • Karl Kashofer,
    • Stephan W Jahn &
    • Luca Abete
  5. Department of Obstetrics and Gynecology, Medical University of Graz, Graz, Austria.

    • Gunda Pristauz &
    • Edgar Petru

Contributions

P.U. and M.R.S. designed the study. M.A. and R.G. performed the experiments. P.U., G.G.T., J.B.G., E.H., and M.R.S. analyzed data. E.P. and G.P. provided clinical samples and clinical information. S.W.J. and L.A. performed pathology analyses. K.K. conducted RNA-seq. PU., E.H., and M.R.S. supervised the study. P.U., J.B.G., E.H., and M.R.S. wrote the manuscript. All authors revised the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

Supplementary Figures

  1. Supplementary Figure 1: Mapping of the nucleosome-depleted region. (93 KB)

    Localization of the NDR, which was mapped by analyses of 100 (red) and 1,000 (orange) highly expressed genes in the 104 plasma samples from healthy donors and which was most often observed in a –150 bp to +50 bp window with respect to the TSS (blue, 1,000 most weakly expressed genes).

  2. Supplementary Figure 2: Classification of the 5,000 most highly and least expressed genes. (118 KB)

    Support vector machine (SVM) classification based on normalized 2K-TSS and NDR coverage for the 5,000 most highly and least expressed genes. Red and dark blue circles represent genes correctly assigned to the expressed and unexpressed clusters, respectively, whereas light blue and orange circles represent incorrectly assigned genes (as in Fig. 3b).

  3. Supplementary Figure 3: Quantitative relationship between nucleosome occupancy and gene expression. (167 KB)

    (a) Correlation between 2K-TSS (left) and NDR (right) coverage and FPKM percentiles. (b) Means and distribution of the 2K-TSS and NDR coverage parameters of genes grouped into deciles. (c) Average FPKM percentile of binned 2K-TSS and NDR coverage parameters.

  4. Supplementary Figure 4: Comparison of copy number profiles of the matching primary tumor with plasma DNA. (222 KB)

    The copy number profiles of the matching primary tumors B7 (top) and B13 (bottom) were obtained by whole-genome sequencing with a shallow sequencing depth. Pairwise comparisons of genomic position–mapped profiles revealed high correlations between the copy number profiles (Pearson correlation coefficients = 0.74 (B7) and 0.88 (B13)).

  5. Supplementary Figure 5: Reconstruction of the 12p11.1 nucleosome array with high-coverage sequenced plasma samples. (100 KB)

    Assembly of the 12p11.1 nucleosome arrays in plasma samples from B7, B13, controls, and GM12878 for comparison.

  6. Supplementary Figure 6: TSS nucleosome occupancy of unexpressed and housekeeping genes in high-coverage sequenced plasma samples in B7 and B13. (161 KB)

    (a,b) Nucleosome occupancy at TSSs of unexpressed genes (fantom.gsc.riken.jp/5/) and housekeeping genes2 had the expected different pattern for B7 (a) and B13 (b).

  7. Supplementary Figure 7: Distribution of the prediction consent. (28 KB)

    Histogram of prediction consent in merged control (n=104) data. For the majority of genes, the prediction consent was above 95%; there are only a few genes with a prediction consent below 75%.

PDF files

  1. Supplementary Text and Figures (1,648 KB)

    Supplementary Figures 1–7 and Supplementary Note.

Excel files

  1. Supplementary Table 1 (11 KB)

    List of the 100 most highly expressed genes based on plasma RNA-seq data provided by Koh et al. but predicted to be unexpressed (n = 12).

  2. Supplementary Table 2 (19 KB)

    List of the 1,000 most highly expressed genes based on plasma RNA-seq data provided by Koh et al. but predicted to be unexpressed (n = 245).

  3. Supplementary Table 3 (11 KB)

    Subsampling of sequencing data to establish a lower boundary of necessary sequencing coverage.

  4. Supplementary Table 4 (19 KB)

    Prediction and FPKM values for genes in focal amplifications having log2 ratio > 1 in breast cancer case B7.

  5. Supplementary Table 5 (25 KB)

    Prediction and FPKM values for genes in focal amplifications having log2 ratio > 1 in breast cancer case B13.

Additional data