High-throughput analysis of single human cells reveals the complex nature of DNA replication timing control

DNA replication initiates from replication origins firing throughout S phase. Debate remains about whether origins are a fixed set of loci, or a loose agglomeration of potential sites used stochastically in individual cells, and about how consistent their firing time is. We develop an approach to profile DNA replication from whole-genome sequencing of thousands of single cells, which includes in silico flow cytometry, a method for discriminating replicating and non-replicating cells. Using two microfluidic platforms, we analyze up to 2437 replicating cells from a single sample. The resolution and scale of the data allow focused analysis of replication initiation sites, demonstrating that most occur in confined genomic regions. While initiation order is remarkably similar across cells, we unexpectedly identify several subtypes of initiation regions in late-replicating regions. Taken together, high throughput, high resolution sequencing of individual cells reveals previously underappreciated variability in replication initiation and progression.


Supplementary Figure 2. Distribution of single-cell read counts in 200kb windows depends on S-phase progression.
Forty single cells from the S-phase FACS fraction library are shown. Read counts are distributed around a single mode in cells in G1/G2 (top row). The bimodal distribution of read counts in replicating cells is most apparent in mid-S phase. For the left-most cell in each row, read counts and corresponding copy-number inferences (CN) are displayed for the whole genome. Representative examples were selected to reflect patterns across S-phase progression.
Supplementary Figure 3. Assessment of replication state inference by hidden Markov model (HMM) with simulated data. a 2,500 cells were simulated with known "true" replication states at each genomic window, distributed across S phase. Each row represents one simulated cell. b Raw read counts were drawn from a Poisson process for each "true" replication state, with rate coefficients chosen to match the coverage of the GM12878 unsorted library. These raw counts were then processed with the same pipeline as the observed data. Each row represents the corresponding cell in a. Incorrect copy-number inferences were most frequent in mid-S phase cells (c) and in lowercoverage cells (d), while the effects of replication timing were mild (e). f Copy-number inference allowed identification of 98.6% of "true" replicons. Replicons comprised of a single 20kb window were frequently missed (26.8% not successfully detected). Replicons comprised of at least two windows were detected in >90% of cases.
Supplementary Eight representative examples are shown, in which a single IR was called, but visual inspection suggests multiple discrete initiation sites that have not been disambiguated, supporting the notion that these IR widths are overestimated. One counterexample is shown in the bottom right (Chr1, ~70.8Mb) -this IR's width may be due to it encompassing a cluster of origins in close proximity or to replication fork asymmetry.
Supplementary Figure 6. Metrics of in silico cell sorting for all cell lines under study. a In silico sorting identifies two distinct populations of cells based on the relationship between mean reads per Mb and scaled MAPD, corresponding to G1/G2-and S-phase cells. GM18507 cells were sequenced after DLP+ library preparation 2 . b In silico sorting is robust in aneuploid cancer cell lines (e.g., MCF-7) and corrects for the effects of copy-number aberrations (CNAs). Top panel: MCF-7 contains numerous CNAs with a wide range of copy-number values. Each dot represents the sum of read counts across all in silico assigned S-phase cells in fixed 20kb windows. Bottom panel: Using in silico assigned G1/G2 cells to define genome windows reveals more subtle copy-number fluctuations corresponding to replication timing. Each dot represents the sum of read counts across all in silico assigned S-phase cells in G1-defined windows. Figure 2a for all additional cell lines. Top panels: aggregate S/G1 replication timing profiles from in silico sorting of single cells recapitulates the bulksequencing replication timing profile for the same cell line, in all cell lines. Bottom panels: single cell replication state in fixed 20kb windows. Each row represents a single cell.

Supplementary Figure 7. Replication state inference for thousands of single cells across human lymphoblastoid cell (green), embryonic stem cell (blue), and cancer-derived cell (pink) lines. As in
Supplementary Figure 8. Cell-type-and cell-line-specific differences in the aggregate replication timing profiles are reflected at the single cell level. The replication timing profiles shown are S/G1 bulk-sequencing profiles. a Celltype differences in replication timing between LCL and ESCs are observed at the single cell level. Left column: an ESC-specific peak in the bulk profile ~196.1Mb corresponds to a region of consistent replication initiation across all three ESCs, which is absent in GM1278 and GM12892 and fires sporadically in GM12891. Likewise, an LCL-specific peak in the bulk profile ~198.5Mb corresponded to a region of consistent replication initiation in LCL but not in ESC. Right: a second example of an LCL-specific region of replication initiation. b Differences in replication timing unique to individual cancer cell lines are consistent at the single-cell level. Left column: an MCF-7-specific peak ~81.5Mb is observed at the single cell level only in that cell line. To clarify that this initiation site is not observed in H1, a neighboring peak in ESC ~81.9Mb is also indicated with a vertical black line. Right column: an RKO-specific peak in the bulk profile is observed to be unique to RKO single cells as well.
Supplementary Figure 9. The number of initiation region (IR) calls increases non-linearly with increasing number of cells analyzed. GM12878 cells were down-sampled (n = 250; 500; 1,000; 2,000) and initiation regions were called from each sample (black circles). Down-sampling was performed five times for each sample size to account for stochastic effects of small sample sizes. The observed number of IRs in each cell line is shown for comparison (colored dots).
Supplementary Figure 10. Initiation regions (IRs) fire in a similar, but not fixed order across cell lines. a The distribution of IR widths is shown for all IRs genome-wide in each cell line under study. With the exception of HCT-116, where only 758 larger IRs were called, the distribution of IR widths was similar across cell lines and cell types, despite the noted cell-type specific difference in IR location and the ~10x reduction in sample size relative to GM12878. b IRs primarily fire within the expected order (light and dark gray), but sometimes fire earlier (red) or later (blue) than expected by a strictly determined ordering. There are no observed cell-type-specific differences. Compare to Figure 4c. c Variability in IR firing order is not uniform across S phase, but least pronounced at the beginning and end of S phase. This trend is not apparent in HCT-116 where almost all called IRs were early firing. Compare to Figure 4d. GM18507 cells were sequenced after DLP+ library preparation 2 .
Supplementary Figure 11. Initiation regions (IRs) with early replication timing in aggregate tend to complete replication in early S phase, whereas IRs with late replication timing in aggregate tend to fire across a wider range of S phase. a The range of IR firing was more constrained in IRs with early aggregate replication timing and less constrained in IRs with late aggregate replication timing. With a smaller sample size, we were able to call fewer IRs and we were able to corroborate the early S-phase firing of many IRs in a second cell. This produces a more constrained range in the additional cell lines (most pronounced in HCT-116). A random subsample of ¼ of the IRs is shown for GM18507, where the large number of IRs called obscures the trend. b Most IRs were observed to fire in early S phase in at least one cell. Each panel shows the cumulative percent of IRs that fire earliest before a given time in S phase. Across cell lines, 80-97% of IRs were observed to first fire in a cell <50% replicated. c Distribution of late IR classes across the additional cell lines. Although the smaller sample sizes resulted in calling about half as many IRs as in GM12878, a class of IRs replicating throughout S phase was observed in almost every cell line. The effects of sampling are most pronounced in the inflated proportion of IRs categorized as "early". GM18507 cells were sequenced after DLP+ library preparation 2 .