G-quadruplexes (G4s) are nucleic acid secondary structures that form within guanine-rich DNA or RNA sequences. G4 formation can affect chromatin architecture and gene regulation and has been associated with genomic instability, genetic diseases and cancer progression1,2,3,4. Here we present a high-resolution sequencing–based method to detect G4s in the human genome. We identified 716,310 distinct G4 structures, 451,646 of which were not predicted by computational methods5,6,7. These included previously uncharacterized noncanonical long loop and bulged structures8,9. We observed a high G4 density in functional regions, such as 5′ untranslated regions and splicing sites, as well as in genes previously not predicted to contain these structures (such as BRCA2). G4 formation was significantly associated with oncogenes, tumor suppressors and somatic copy number alterations related to cancer development10. The G4s identified in this study may therefore represent promising targets for cancer intervention.
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Gene Expression Omnibus
Rodriguez, R. & Miller, K.M. Unravelling the genomic targets of small-molecules using high-throughput sequencing. Nat. Rev. Genet. 15, 783–796 (2014).
Wolfe, A.L. et al. RNA G-quadruplexes cause eIF4A-dependent oncogene translation in cancer. Nature 513, 65–70 (2014).
Maizels, N. Genomic stability: FANCJ-dependent G4 DNA repair. Curr. Biol. 18, R613–R614 (2008).
Haeusler, A.R. et al. C9orf72 nucleotide repeat structures initiate molecular cascades of disease. Nature 507, 195–200 (2014).
Huppert, J.L. & Balasubramanian, S. Prevalence of quadruplexes in the human genome. Nucleic Acids Res. 33, 2908–2916 (2005).
Eddy, J. & Maizels, N. Gene function correlates with potential for G4 DNA formation in the human genome. Nucleic Acids Res. 34, 3887–3896 (2006).
Kikin, O., D'Antonio, L. & Bagga, P.S. QGRS Mapper: a web-based server for predicting G-quadruplexes in nucleotide sequences. Nucleic Acids Res. 34, W676–W682 (2006).
Mukundan, V.T. & Phan, A.T. Bulges in G-quadruplexes: broadening the definition of G-quadruplex-forming sequences. J. Am. Chem. Soc. 135, 5017–5028 (2013).
Guédin, A., Gros, J., Alberti, P. & Mergny, J.L. How long is too long? Effects of loop size on G-quadruplex stability. Nucleic Acids Res. 38, 7858–7868 (2010).
Zack, T.I. et al. Pan-cancer patterns of somatic copy number alteration. Nat. Genet. 45, 1134–1140 (2013).
Bochman, M.L., Paeschke, K. & Zakian, V.A. DNA secondary structures: stability and function of G-quadruplex structures. Nat. Rev. Genet. 13, 770–780 (2012).
Cruz, J.A. & Westhof, E. The dynamic landscapes of RNA architecture. Cell 136, 604–609 (2009).
Davis, J.T. G-quartets 40 years later: from 5′-GMP to molecular biology and supramolecular chemistry. Angew. Chem. Int. Ed. Engl. 43, 668–698 (2004).
Biffi, G., Tannahill, D., McCafferty, J. & Balasubramanian, S. Quantitative visualization of DNA G-quadruplex structures in human cells. Nat. Chem. 5, 182–186 (2013).
Henderson, A. et al. Detection of G-quadruplex DNA in mammalian cells. Nucleic Acids Res. 42, 860–869 (2014).
Biffi, G., Tannahill, D., Miller, J., Howat, W.J. & Balasubramanian, S. Elevated levels of G-quadruplex formation in human stomach and liver cancer tissues. PLoS ONE 9, e102711 (2014).
Weitzmann, M.N., Woodford, K.J. & Usdin, K. The development and use of a DNA polymerase arrest assay for the evaluation of parameters affecting intrastrand tetraplex formation. J. Biol. Chem. 271, 20958–20964 (1996).
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
Rodriguez, R. et al. A novel small molecule that alters shelterin integrity and triggers a DNA-damage response at telomeres. J. Am. Chem. Soc. 130, 15758–15759 (2008).
Hud, N.V., Smith, F.W., Anet, F.A.L. & Feigon, J. The selectivity for K+ versus Na+ in DNA quadruplexes is dominated by relative free energies of hydration: A thermodynamic analysis by H-1 NMR. Biochemistry 35, 15383–15390 (1996).
Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated sequencer traces using Phred. 1. Accuracy Assessment. Genome Res. 8, 175–185 (1998).
Rodriguez, R. et al. Small-molecule-induced DNA damage identifies alternative DNA structures in human genes. Nat. Chem. Biol. 8, 301–310 (2012).
Fernando, H. et al. A conserved quadruplex motif located in a transcription activation site of the human c-kit oncogene. Biochemistry 45, 7854–7860 (2006).
Rankin, S. et al. Putative DNA quadruplex formation within the human c-kit oncogene. J. Am. Chem. Soc. 127, 10584–10589 (2005).
Marchand, A. et al. Ligand-induced conformational changes with cation ejection upon binding to human telomeric DNA G-quadruplexes. J. Am. Chem. Soc. 137, 750–756 (2015).
De Cian, A., DeLemos, E., Mergny, J.-L., Teulade-Fichou, M.-P. & Monchaud, D. Highly efficient G-quadruplex recognition by bisquinolinium compounds. J. Am. Chem. Soc. 129, 1856–1857 (2007).
Palumbo, S.L., Ebbinghaus, S.W. & Hurley, L.H. Formation of a unique end-to-end stacked pair of G-quadruplexes in the hTERT core promoter with implications for inhibition of telomerase by G-quadruplex-interactive ligands. J. Am. Chem. Soc. 131, 10878–10891 (2009).
Bugaut, A. & Balasubramanian, S. A sequence-independent study of the influence of short loop lengths on the stability and topology of intramolecular DNA G-quadruplexes. Biochemistry 47, 689–697 (2008).
Siddiqui-Jain, A., Grand, C.L., Bearss, D.J. & Hurley, L.H. Direct evidence for a G-quadruplex in a promoter region and its targeting with a small molecule to repress c-MYC transcription. Proc. Natl. Acad. Sci. USA 99, 11593–11598 (2002).
Paeschke, K. et al. Pif1 family helicases suppress genome instability at G-quadruplex motifs. Nature 497, 458–462 (2013).
Cock, P.J., Fields, C.J., Goto, N., Heuer, M.L. & Rice, P.M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 38, 1767–1771 (2010).
We thank C. Lowe and D. Tannahill for critical reading of the manuscript and D. Beraldi for technical support. We thank P. McCauley (Illumina) who prepared the custom sequencing buffers. We are grateful to the Biotechnology and Biological Sciences Research Council (BBSRC) and Illumina for the studentship supporting V.S.C. (BB/I015477/1). The S.B. research group is supported by program funding from Cancer Research UK and the European Research Council and project funding from BBSRC.
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Overall sequencing quality in sequencing experiments with the cations Li+, Na+ and K+.
Each plot visually shows base calling quality (Phred quality score, Q; y-axes) for the 250 sequenced bases (x-axes), in two independent experiments, with sequencing buffers containing Li+ (top), Na+ (middle) and K+ (bottom), as generated by the program FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Yellow bars and black whiskers are box plots for the respective base positions; red lines are median values; blue lines are mean values.
Supplementary Figure 2 Sequencing quality and sequencing errors (% mismatches) for control sequences.
Bar plots showing: (a) the average Phred quality score (Q) and (b) % mismatches for the 4 control sequences when sequencing with different cations Li+ (left), Na+ (middle) and K+ (right). c-kit, c-myc: positive controls; c-myc-opp, c-myc-mut: negative controls (see Methods). Data is taken from a number of independent sequencing experiments: 3 for Na+, 2 for Li+ and K+. The numbers of different control sequences (i.e. independent sequencing clusters on the flow cell) in the combined experiments are (order; c-kit, c-myc-opp, c-myc, c-myc-mut): 2741, 1139, 1040, 10945 for Li+; 8235, 3076, 2787, 26974 for Na+; 2935, 1315, 1, 12809 for K+. Bars are standard deviations. No error bar present for c-myc in K+ (n=1).
% mismatches for the control sequences in the same sequencing experiment with Na+ sequencing buffers during the first read (Read-1; left) followed by the addition of the small-molecule PDS in Na+ throughout the second read (Read-2; right). Error bars are SEMs (respectively: 0.16, 0.02, 0.18 and 0.07 for left plot; 0.12, 0.08, 0.15 and 0.09 for right plot). N = 948, 367, 367 and 3990 for c-kit, c-myc-opp, c-myc, c-myc-mut.
Supplementary Figure 4 Heat-map of c-kit control sequence representing the base called at each position along the template in read 1 Na+ conditions and read 2 Na+ + PDS G4-stabilizing conditions.
Each line represents an independent read. Each column is a position in the template. The different colors represent the different bases which are called: A= green, C= blue, T= red, G= orange. As depicted in the figure below, in Read-1 (top panel) the sequencing polymerase incorporates the expected bases with high fidelity until the end of the template, after which scrambled sequencing is observed as the polymerase no longer has a template to process. In Read-2 (bottom panel), under G4-stabilizing conditions, a similar effect on base-calling is observed from the G4 structure onwards, suggesting that the polymerase stalls at the G4 structure. Ordinarily, DNA fragments displaying such truncations are removed in the data analysis, whereas in our study they are intentionally retained.
Phred Quality Score (Q) along the c-kit control sequence in: Read-1 Na+ conditions (top panel) and Read-2 Na+ + PDS G4 stabilizing conditions (bottom panel). A clear drop in Q score is observed in Na+ at the truncation site (end of the fragment). A similar effect can be observed at the G4 start site exclusively under G4-stabilizing conditions.
a) Venn diagram showing the overlap between the two conditions. The numbers in brackets at the top indicate the total OQs detected in K+ (525,890) and Na++PDS (716,310); the percentages on the side (27% and 46%) indicate the OQs detected exclusively by either condition; the percentages in the middle overlapping both circles indicate OQs detected by both conditions, as percentage of total K+ OQs (73%) or total PDS OQs (54%). The 27% OQs detected solely in K+ displayed an average mismatch percentage of ~18.5% in PDS (panel b); the 46% OQs detected solely in PDS displayed an average mismatch percentage of ~10% in K+ (panel c). The average mismatch percentage of randomly selected genomic locations is below 2%. b) Mismatches for OQs detected in K+ only (27% subset in the Venn diagram). c) Mismatches for OQs detected in PDS only (46% subset in the Venn diagram). In both b) and c), red lines indicate the thresholds set for the two data sets, 18% for K+ and 25% for PDS (Methods).
a) % mismatches for the control sequences in the same sequencing experiment with Na+ sequencing buffers for the first read (Read-1; left) followed by the addition of the small-molecule PhenDC3 in Na+ throughout the second read (Read-2; right). b) % OQs detected in both K+ and PDS G4-Seq experiments (Left), K+ and PhenDC3 (Middle), PDS and PhenDC3 (Right). 85% of all OQs detected in PDS were also observed with PhenDC3. c) % PQs detected in both K+ and PDS G4-Seq experiments (Left), K+ and PhenDC3 (Middle), PDS and PhenDC3 (Right). 93% of all PQs detected in PDS were also observed with PhenDC3. d) % predicted G4s (PQs) among all the detected G4s (OQs) in different G4-Seq experiments. PhenDC3 and PDS show similar capability to detect G4s genome-wide. e) Summary of the comparison between different G4-Seq experiments.
Left: canonical PQs predicted by Quadparser (L1-3=N1-7, with N=A|C|T|G). Middle: PQs with longer loops (L1-3=N8-12 or L2=N8-21). Right: PQs with a single bulge B1=H1-7 or multiple bulges B2=H1-5 (H=A|T|C).
Supplementary Figure 9 Detection of OQs representing different G-quadruplex structural families in PDS versus K+ conditions.
Fold enrichment (ratio) between the numbers of OQs in PDS over K+ for each structural family. Values > 1 indicate higher numbers in PDS. G-quadruplex structural families: Loop 1-3; Loop 4-5; Loop 6-7: OQs with at least one loop of the indicated length; Long loops: OQs with any loop of length 8 to 12 for L1-3 or 8 to 21 for L2; Bulges: OQs with one bulge of 1 to 7 bases (A, T, C) or multiple bulges of 1 base.
a) Genome browser view of a genomic region within MYC oncogene. Red and orange tracks: % mismatches in reads aligning to the reverse strand (-) for PDS and K+, respectively. OQ intervals are shown as red and orange bars below the corresponding peaks.. b) Genome browser view of a genomic region within the MYL5-MFSD7 gene. Black and blue tracks: % mismatches in reads aligning to the forward strand (+) for PDS and K+, respectively. OQ intervals are shown as black and blue bars below the corresponding peaks. c) Genome browser view of a genomic region within the MYL9 gene. All colors and features as in a). See Supplementary Table 5 for sequence details. For all panels, OQs not predicted by Quadparser are indicated by * and Quadparser PQs are shown as black bars.
Supplementary Figure 11 Comparison of forward versus reverse strands in PDS sequencing conditions for the MYL9 gene.
A) Genomic region within the MYL9 gene. Red and black tracks: % mismatches in reads aligning to the reverse strand (-) and forward strand (+), respectively. OQs intervals are shown as red and black bars below corresponding peaks. Quadparser PQs are shown below in black. OQs not predicted by Quadparser are indicated by asterisks (*). See Supplementary Table 5 for sequence details.
Supplementary Figure 12 Circular dichroism (CD) spectra of noncanonical G4 sequences identified in the FOSB and MUC20 genes as OQs by G4-seq.
a) a sequence from the FOSB gene promoter with a single bulge in one G-run: GGC GCG GGC GGG GCG CGG G; b) an additional sequence from the FOSB gene promoter with a long loop: GGG GCG GGT GAC GTA AGC AGG GGG GCG GG; c) a sequence from the FOSB gene 5' UTR containing both a long loop and two bulges: GAG GTA CAG CGG CAT CCT GTG GGG GCC TGG G; d) a sequence from the MUC20 gene promoter containing multiple bulges: GGA GTG TTT ACC GGG CAA GGA AGG CAG GG. The CD analysis is consistent with G4 formation for all of these sequences (λ max ~263 nm: λ min ~240 nm)S1.
Supplementary Figures 1–12 and legends (PDF 5196 kb)
Supplementary Tables 1–4 and legends; Supplementary Tables 7–8 and legends; Legends for Supplementary Tables 5 and 6 (PDF 457 kb)
Genomic coordinates and full sequences of selected OQ regions (XLSX 67 kb)
Number of OQs present in annotated genes (XLSX 3513 kb)
Sequencing script for G4-Seq MiSeq experiments for either R2=K+ or Na+ + PDS conditions (XML 29 kb)
Sequencing script for G4-Seq HiSeq experiments in which R1=Na and R2= K+ (XML 28 kb)
Sequencing script for G4-Seq MiSeq HiSeq experiments in which R1=Na and R2= Na+ + PDS (XML 33 kb)
Computational analysis scripts for G4-Seq (ZIP 14 kb)
About this article
Cite this article
Chambers, V., Marsico, G., Boutell, J. et al. High-throughput sequencing of DNA G-quadruplex structures in the human genome. Nat Biotechnol 33, 877–881 (2015). https://doi.org/10.1038/nbt.3295
Analytical Sciences (2021)
Genome-wide analysis of DNA G-quadruplex motifs across 37 species provides insights into G4 evolution
Communications Biology (2021)
Nucleic Acids Research (2021)
Russian Chemical Reviews (2021)
TrAC Trends in Analytical Chemistry (2021)