The genetic architecture of DNA replication timing in human pluripotent stem cells

DNA replication follows a strict spatiotemporal program that intersects with chromatin structure but has a poorly understood genetic basis. To systematically identify genetic regulators of replication timing, we exploited inter-individual variation in human pluripotent stem cells from 349 individuals. We show that the human genome’s replication program is broadly encoded in DNA and identify 1,617 cis-acting replication timing quantitative trait loci (rtQTLs) – sequence determinants of replication initiation. rtQTLs function individually, or in combinations of proximal and distal regulators, and are enriched at sites of histone H3 trimethylation of lysines 4, 9, and 36 together with histone hyperacetylation. H3 trimethylation marks are individually repressive yet synergistically associate with early replication. We identify pluripotency-related transcription factors and boundary elements as positive and negative regulators of replication timing, respectively. Taken together, human replication timing is controlled by a multi-layered mechanism with dozens of effectors working combinatorially and following principles analogous to transcription regulation.

(a-d) Validation of rtQTLs in 192 iPSC lines (Methods; two-sided binomial tests). The left panels are examples of rtQTLs in hESCs. The right panels show replication timing in the same regions in iPSCs, stratified by the genotype of the top rtQTL SNP discovered in the hESCs (vertical line). Association p-values in iPSCs are indicated. Excellent agreement between hESCs and iPSCs demonstrate that the rtQTLs discovered in hESCs are reproducible in an independent cohort. (e-g) SMARD (single-molecule analysis of replicated DNA 23 ) analysis of an rtQTL on chromosome 2 (Fig. 1i) in Mel1 and H9 cell lines confirms variation in initiation site activity consistent with rtQTL genotypes. (f) Replication timing flanking the rtQTL locus (gray region); green line: the region analyzed by SMARD. The initiation site on the left side of the green line is an rtQTL (panel e), at which Mel1 and H9 carry the early-replicating and heterozygous genotype, respectively. (g) SMARD results, where each line indicates one DNA molecule, and the shift from red to green reveals the location and direction of replication forks (yellow arrows). Significantly more forks are progressing from 5' to 3' in Mel1 when compared with H9 (p = 4.69´10 -4 , Fisher's exact test, two-sided), indicating that the upstream initiation site is much stronger in Mel1 than H9, consistent with the rtQTL analysis. (h, i) rtQTLs are highly reproducible between the ESCs and iPSCs. When directly testing ESC rtQTLs using iPSCs (h) or vice versa (i), the p-values show strong positive correlation.
Supplementary Figure 3. Primary and Secondary rtQTLs have Significantly More 3D Contacts than Expected by Chance.
The median Hi-C contact score for all primary-secondary rtQTL pairs was 2.037 (red vertical bar). This value is significantly higher than 100 random permutations (black distribution), in which the distances between primary and secondary rtQTLs were preserved but actual genomic locations were randomly shifted. We confirmed the normality assumption of the plotted distribution using the Wilks-Shapiro test (p > 0.05; two-sided). (a, b) Enrichment of chromHMM chromatin states at rtQTLs identified in hESCs (a) or iPSCs (b). Orange bars: 95% confidence intervals. NS: not significant at Bonferroni-corrected p = 0.05. (c, d) Enrichment of histone marks at hESC (c) and iPSC (d) rtQTLs. Similar to panels a and b. (e) Breakdown of gene types located within rtQTLassociated regions. The number of genes in rtQTL-associated regions was significantly lower than expected (p = 4.85´10 -17 , two-sided Z-test) without enrichment for gene ontology terms 44 . (f) Functional annotations of rtQTL genetic variants. (g) rtQTLs colocalize with active histone modifications. Bottom panels: hESC ChIP-seq tracks of active histone modifications. Imputed histone tracks 45,46 from the Roadmap Epigenomics Project were used. Red arrows: locations of rtQTL variants indicated in the top panels. (h) A multi-rtQTL region (same as Fig. 2c) at which both the primary and secondary rtQTLs overlap active histone marks. (i, j) No enrichment in histone marks was observed at non-peak (i) and non-rtQTL (j) loci that match the replication timing of peaks or rtQTLs, respectively. hESC replication timing profiles were used, with the same enrichment analysis procedures as used for rtQTLs (panel c). (k-n) Enrichment analysis of RepeatMasker repetitive elements (k, m) and non-B DNA motifs (l, n) at rtQTLs (k, l) and the histone modification sites (m, n). Orange bars: 95% confidence intervals. LINE: long interspersed nuclear elements. SINE: short interspersed nuclear elements. LTR: long terminal repeats. Significant enrichments were only observed for G4 and Z-DNA motifs at the histone modification sites (panel n). n = 592 genomic regions analyzed in panels a, c, k, l, m, n; n = 1,126 genomic regions analyzed in panels b and d; n = 2,663 genomic regions analyzed in panel i; n = 12,003 genomic regions analyzed in panel j. All error bars are 95% confidence intervals. For panels a, b, c, d, i, j, k, l, m, and n, the center of the error bars in median. These panels used the Binomial test; the displayed -log10(p-values) were uncorrected, but the significance threshold was corrected for multiple testing. (a, b) Correlation of histone acetylation marks at hESC rtQTL sites, using imputed (a) or observed (b) data. (c, d) me 3 ac hyper regions are closer to peaks than permutations. Bins containing <=3 regions were excluded. (e) Histone mark combinations correspond to replication initiation sites. (f) Probability of having an initiation site as a function of distance from histone marks (40 kb bins), for each individual histone combination. (g) Normalized cumulative probability of initiation sites within 200 kb of individual histone marks/combinations. Probabilities were normalized by subtracting the permutation mean. Error bars: standard deviation. n = 414 permutations (first bar), n = 27, 160, 129, 72, 13, 13 histone modifications or combinations (for 1/2/3/4/5-mark and me 3 ac hyper , respectively). (h, i) Number (h) and total size (i) of regions of histone marks or combinations. Error bars: standard deviation. (j) ROC curves for histone modifications predicting replication initiation sites. Diagonal lines: random guesses. (k, l) ROC analysis of histone marks and combinations. (k) Distribution of pAUC for histone marks and combinations. Partial area under the left half of the curve was calculated. Error bars: standard deviation. Horizontal bar: pAUC of random guess. In panels h, i, and k: n = 3, 29, 48, 34, 12, 13 histone modifications or combinations (for 1/2/3/4/5mark and me 3 ac hyper , respectively). In panels g, h, i, and k, the bars represent means. (l) Averaged ROC curves for histone marks or combinations. (m) Genomic regions with high density of the me 3 ac hyper histone modifications are closer to replication timing peaks than isolated histone modification regions. (n) Histone modifications are specific to replication timing peaks. Comparison between distance from histone modification regions to actual replication timing peaks and regions with similar replication timing (distant from peaks). (o) Cumulative replication timing surrounding histone modifications. Gray: ten permutations. (p) Cumulative replication timing in hESCs and LCLs surrounding histone modification locations found in both cell types (gray), LCLs only (orange), or hESCs only (blue). (q) Histone modification combinations are more specific at predicting replication timing peaks than DNase hypersensitivity sites.

Supplementary
Supplementary Figure 6. Validation of Histone Enrichments in Individual Cell Lines and Using "Gapped" or "Narrow" Peak.
(a, b) Enrichment of histone marks at hESC rtQTLs, using gappedPeak (panel a) or narrowPeak (panel b) data from individual hESC lines. Each column represents an hESC line and each row represents a histone mark. The E001, E002, E003, E008, E014, E015, E016, E024 cell lines are also known as I3, H7, H1, H9, HUES48, HUES6, HUES64, and UCSF4, respectively. Asterisk: also enriched using the union of histone mark data from all eight hESC lines (Fig. S4c). In panel a, all histone marks with an asterisk were enriched in at least one hESC line (p < 6.25×10 -3 , i.e., Bonferroni-corrected for the 8 hESC lines), while 20 of them were enriched in at least four hESC lines. In panel b, 22 of the 24 histone marks with an asterisk were enriched in at least one hESC line. Panels a and b used the Binominal test; the displayed -log10(p-values) were uncorrected for multiple testing, Bonferroni correction was used when determining significance threshold. (c, d) me 3 ac hyper regions identified using gappedPeak (panel c) or narrowPeak (panel d) data from individual hESC lines both show close proximity to replication initiation sites. There were a total of 5,120 and 2,252 me 3 ac hyper regions identified across eight hESC lines using gappedPeak or narrowPeak data, respectively. These results reproduce those in Fig. S4c and Fig. 3e using data from individual hESC lines as well as using narrowPeak data. For panels c and d, a One-sided Wilcoxon rank-sum test was used; correction for multiple testing was not applicable as there was only one test for each sub-plot. Figure 7. Examples of me 3 ac hyper Regions in Individual hESC Lines. Fig. 3d. Roadmap Epigenomics Project data was used for histone signal track plotting. Experimentally measured data (i.e., "observed") data was used whenever available, and computationally imputed data was used otherwise. For simplicity, only one of the variable acetylation marks is plotted. The specific hESC line from which the histone signal tracks were derived is indicated in each panel.