Order and stochasticity in the folding of individual Drosophila genomes

Mammalian and Drosophila genomes are partitioned into topologically associating domains (TADs). Although this partitioning has been reported to be functionally relevant, it is unclear whether TADs represent true physical units located at the same genomic positions in each cell nucleus or emerge as an average of numerous alternative chromatin folding patterns in a cell population. Here, we use a single-nucleus Hi-C technique to construct high-resolution Hi-C maps in individual Drosophila genomes. These maps demonstrate chromatin compartmentalization at the megabase scale and partitioning of the genome into non-hierarchical TADs at the scale of 100 kb, which closely resembles the TAD profile in the bulk in situ Hi-C data. Over 40% of TAD boundaries are conserved between individual nuclei and possess a high level of active epigenetic marks. Polymer simulations demonstrate that chromatin folding is best described by the random walk model within TADs and is most suitably approximated by a crumpled globule build of Gaussian blobs at longer distances. We observe prominent cell-to-cell variability in the long-range contacts between either active genome loci or between Polycomb-bound regions, suggesting an important contribution of stochastic processes to the formation of the Drosophila 3D genome.


Amplification duplicates removal
In the next step, we performed a correction for amplified duplicates of snHi-C contacts. Standard Hi-C uses amplification by the Illumina PCR protocol with primers that are ligated to the ends of sheared DNA. Thus, two independent Hi-C pairs can be PCR duplicates if their mapping positions coincide (e.g., see hiclib). However, the amplification in snHi-C is followed by sonication, resulting in random breaks of ligated DNA fragments. Hence, coinciding mapping positions cannot be used as a criterion of PCR duplication. Notably, we cannot distinguish the amplified pair contacting restriction fragments from the contacts of the same regions in the homologous chromosomes. Thus, we removed all multiple copies of restriction fragment pairs and retained unique contacts for each combinatorial pair of restriction fragments.

Fragment filtration
In the next step, we used restriction fragment filtration to reduce the possible contribution of copy number variation, read misalignment, and Phi29 DNA polymerase template switch that had not been removed by the ORBITA filter. In theory, each restriction fragment of DNA has two ends and is present twice in the diploid nucleus of ML-DmBG3-c2 Drosophila cells; thus, we expect the upper limit of four unique contacts per restriction fragment if no unannotated genomic rearrangements, mismappings, or template switches occurred. For each restriction fragment, we calculated the observed number of contacts and removed fragments that had more than four contacts. Before contact filtration by this rule, we compared the number of restriction fragments with more than four unique contacts according to ORBITA and one previous approach, hiclib for Flyamer et al. 2017. We obtained datasets for mouse nuclei from Flyamer et al. 2017 andNagano et al. 2017 and mapped with the hiclib and ORBITA pipelines. We found a significant reduction in the number of unique contacts per fragment for snHi-C from Phi29 DNA polymerase datasets (Flyamer et al. 2017, present work], but not for scHi-C without Phi29 DNA polymerase (Nagano et al. 2017) ( Supplementary Fig. 2, 3). Thus, we conclude that ORBITA is an effective approach to reduce the number of snHi-C artefactual contacts arising from random template switches of Phi29 DNA polymerase.

snHi-C interaction map construction
The resulting pair data were binned at 1 kb, 10 kb, 20-kb, 40-kb and 100-kb resolutions with cooler version 0.8.5 and stored in the COOL format. The HiGlass server was used for data visualization.
For bulk BG3 in situ Hi-C (two biological replicates), reads were mapped to Drosophila reference genome dm3 with Burrows-Wheeler Aligner (BWA-MEM, console version 0.7.17-r1188) with default parameters. For consistency with the snHi-C analysis, the resulting BAM files were parsed with pairtools v0.3.0, (https://github.com/mirnylab/pairtools) using default parameters. The resulting files were sorted by the pairtools module "sort"; replicates were merged by the pairtools module "merge" and duplicates were removed, allowing one mismatch between possible duplicates (pairtools dedup with --max-mismatch 1 and -mark-dups options). The resulting PAIRS file was binned with cooler at the same resolutions as the single-cell datasets. To remove the contribution of possible Hi-C technical artifacts, such as backward ligation, dangling ends, self-circles, and mirror reads, the first two diagonals of Hi-C maps were removed. As the last step of bulk Hi-C processing, the maps were iteratively corrected for the removal of coverage bias by the cooler balance tool with default parameters. For the reproducibility control, both replicates were converted to interaction maps independently by the above pipeline. The resulting maps demonstrated a correlation of 0.9-0.95 as estimated by the HiCRep stratum-adjusted correlation coefficient for intrachromosomal maps smoothed with one-bin offset and genomic distance up to 300 kb at 20 kb resolution. TAD calling in snHi-C and bulk BG3 in situ Hi-C data We used Hi-C map segmentation with lavaburst (v0.2.0) (https://github.com/nvictus/lavaburst) with the modularity scoring function for TAD calling in Hi-C maps at 10-kb resolution.
Compartment annotation in snHi-C and bulk BG3 in situ Hi-C For compartment annotation in bulk BG3 in situ Hi-C, we used eigenvector decomposition of cis-interactions maps for each chromosome, as implemented in cooltools call-compartments tool version 0.2.0 (https://github.com/mirnylab/cooltools).

Epigenetic analysis of TAD boundaries
We plotted the ChIP-chip signal around different types of boundaries with pybbi utility (https://github.com/nvictus/pybbi.git) based on UCSC tools.

Visualization of epigenetic states
The visualization was performed using the pymol software v. 2.3.2 (https://pymol.org/2/). 1D epigenetic data were added to the structure as a bead type and represented with a corresponding color. Analysis of different epigenetic states was performed via Python scripts (https:// github.com/polly-code/DPD_withRemovingBonds).

April 2020
Data Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative. Reporting for specific materials, systems and methods We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.
Raw and processed snHi-C and bulk BG3 in situ Hi-C data are available in the GEO NCBI under accession number GSE131811.
List of publicly available GEO sources used in this study: GSE122603 (Hi-C for Kc167 and BG3 cell lines for comparison of stable TAD boundaries), GSE58821 (MSL; ChIP-seq), GSE69013 (RNA-Seq).
List of publicly available modENCODE data sources used in this study: total RNA of ML-DmBG3-c2 cell line assessed by RNA tiling array (modENCODE id 713) and the ChIP-chip for MOF (id 3041), BEAF-32 (id 921), Chriz (275) snHi-C libraries were prepared in three independent experiments to ensure the robustness of the snHi-C procedure and reproducibility of the data quality.
We obtained filtered contacts for 88 individual nuclei after the initial round of sequencing. Before the second round of sequencing, we assessed the robustness of the number of unique contacts by subsampling of raw datasets ( Supplementary Fig. 2a). For each library, we created a uniform grid of sequencing depth (from 0 to the resulting number of reads with the step of 100,000 reads). We then randomly selected X reads from the full library and calculated the number of unique contacts (as described above) for each number from the grid X. We repeated this procedure ten times and plotted the mean number of unique contacts for each sequencing depth from the grid. We proposed that there are a significant number of cells containing PCR duplicates and that the number of contacts increases slowly depending on the sequencing depth due to the poor efficiency of the snHi-C protocol. Further sequencing of these cells would result in a relatively small improvement of the detectable number of unique contacts. The number of contacts for other cells increases more rapidly with the number of reads but reaches a plateau once the maximum number of unique contacts is achieved. Thus, additional sequencing of these cells might result in reading duplicated contacts. For other cells, the number of contacts grew slowly with sequencing depth (Supplementary Fig. 2a). However, for all these cells, the number of unique contacts gradually increased with no plateau signature. We selected the cells displaying the best growth of the number of contacts, indicative of the good quality of the dataset. The top 20 cells by the number of unique contacts were subjected to an additional round of sequencing. The same mapping and parsing pipeline was used for these datasets. Technical replicates (initial and additional rounds of snHi-C libraries sequencing) were merged at the annotated PAIRS file stage.
All three batches of snHi-C experiments were performed independently The samples were not randomly allocated into experimental groups because every sample consisted of cells of the same cell type taken into experiment from the same biological and growth conditions.
The investigators were not blinded during data collection and analysis because the snHi-C experiments were performed on the untreated cells grown in the same biological and culture conditions.