A predictable conserved DNA base composition signature defines human core DNA replication origins

DNA replication initiates from multiple genomic locations called replication origins. In metazoa, DNA sequence elements involved in origin specification remain elusive. Here, we examine pluripotent, primary, differentiating, and immortalized human cells, and demonstrate that a class of origins, termed core origins, is shared by different cell types and host ~80% of all DNA replication initiation events in any cell population. We detect a shared G-rich DNA sequence signature that coincides with most core origins in both human and mouse genomes. Transcription and G-rich elements can independently associate with replication origin activity. Computational algorithms show that core origins can be predicted, based solely on DNA sequence patterns but not on consensus motifs. Our results demonstrate that, despite an attributed stochasticity, core origins are chosen from a limited pool of genomic regions. Immortalization through oncogenic gene expression, but not normal cellular differentiation, results in increased stochastic firing from heterochromatin and decreased origin density at TAD borders.

. Higher activity origins display higher ubiquity across replicates and cell types (a) Euler diagrams showing the fraction of origins shared by three immortalized cell lines. (b) Black dots show the percentage of origins in each quantile that overlap origins detected in a previous SNS-seq 1 study. Grey dots represent the expected chance overlaps of randomly shuffled, control genomic regions of equal size and number as our origins. P-values obtained by Chi-square Goodness-of-Fit test using observed and expected values for overlap. (c) As in (b) for regions identified by INI-seq 2 . Red dots depict the percentage of early-firing origins identified by INI-seq 2 , which is an in vitro method that identifies earliest firing origins. (d) As in (b) for OK-seq 3 regions. (e) Tightly clustered core origins are more likely to be identified by the alternative origin mapping method OK-seq 3 . Bar plot showing the percentage of tightly clustered core origins (in black) that overlap with DNA replication initiation zones identified by OK-seq. Dotted bars represent the expected chance overlap of randomly shuffled, control genomic regions of equal size and number to OK-seq regions. P-values obtained by Chi-square Goodness-of-Fit test using observed and expected values for overlap. (f) Core origins overlap with the pre-RC components ORC1 and ORC2 binding sites. Graph shows the percentage of origins in each quantile that overlap with regions bound by ORC1 or ORC2 (red) or ORC2 (blue) within ± 2 kb. Paler coloured dots represent the expected chance overlap of randomly shuffled, control genomic regions of equal size and number as our origins. (g) ORC2 binding sites that occupy larger genomic regions are more likely to be associated with DNA replication origins. Pie chart represents the percentage of ORC2-bound sites in the genome that intersect a core or a stochastic origin (within ± 2Kb). Left panel represents ORC2-bound regions longer than 1Kb, and the right panel represents ORC2-bound regions longer than 2 Kb. p-values were obtained using the Chi-square of Goodness-of-Fit test in R with observed and expected overlap values.    Table 2. Y-axis is of arbitrary units representing the importance assigned to each variable by each algorithm. (d) Schematic summary of the hematopoietic cell (HC) differentiation protocol. HC (CD34+) were isolated from three independent human cord blood donors and expanded in three independent cultures for 6-7 days. Then, erythropoietin (+EPO) was added to the culture medium (Day 0) for 6 days, and cells were harvested at day 0, day 3 and day 6 for SNS-seq and RNA-seq analysis.
(e) Origins with increased activity after erythrocyte differentiation (day 6) are in genomic regions that host genes related to erythrocyte differentiation. The genomic coordinates of origins that were significantly upregulated upon EPO addition (day 0 vs day 6) were analysed with GREAT. GREAT analysis was performed on genomic coordinates of the origins that were significantly upregulated upon EPO treatment (day 0 vs day 6). Origin regions were associated with genes using the single-gene (SG) rule of GREAT. Only one category came up as statistically significant at Binomial p-value p<0.05, which was plotted here.    (a) Pie charts representing the percentage of DNA replication initiation events (as assessed by normalized SNS-seq counts) at known origins that originate from Q1, Q2 (core origins) or Q3-10 (stochastic origins) in all cell types used in this study.
(b) Origin G-rich sequence-specificity is lost upon immortalization. In immortalized cells, origins that are down-regulated (black bars) in comparison to the parental cell line (HMEC) tend to overlap with CpGi (left panel) or G4 (right panel) elements. In contrast, origins upregulated upon immortalization (in white bars) have less than expected overlaps with CpGi or G4 elements. For reference, the dotted line shows the percentage of all origins that overlap with a CpGi (left panels) or G4 (right panels) are shown.
(c) Same as in (b), but for core origins that are up-or down-regulated upon immortalization. For reference, the dotted line shows the percentage of core origins that overlap with a CpGi (left panels) or G4 (right panels) are shown.
(d) Mouse core (left panel) and stochastic (right panel) origin density across topologically associating domains (TADs) of mouse embryonic stem cells 6 . Origin density along TAD domains (blue) or equal-size control regions (grey) was computed as follows. TADs were divided into 100 equal bins (slices) and the origin density in each bin was calculated as number of origins per Mb. The p-value was calculated using the non-parametric Wilcoxon test in R.
(e) Core origin density across TADs (determined in hESC H1) that are active in hESC H9 (left panel), HC (middle panel) or HMEC (right panel). Origin density along TADs was computed as in (d).
(f) Core origins coincide with putative regulatory elements. Plot shows the overlap of origins (Q1-Q10) with human genome regions that have putative regulatory functions (as defined by ReMap, >10 peaks).