Binless normalization of Hi-C data provides significant interaction and difference detection independent of resolution

Chromosome conformation capture techniques, such as Hi-C, are fundamental in characterizing genome organization. These methods have revealed several genomic features, such as chromatin loops, whose disruption can have dramatic effects in gene regulation. Unfortunately, their detection is difficult; current methods require that the users choose the resolution of interaction maps based on dataset quality and sequencing depth. Here, we introduce Binless, a resolution-agnostic method that adapts to the quality and quantity of available data, to detect both interactions and differences. Binless relies on an alternate representation of Hi-C data, which leads to a more detailed classification of paired-end reads. Using a large-scale benchmark, we demonstrate that Binless is able to call interactions with higher reproducibility than other existing methods. Binless, which is freely available, can thus reliably be used to identify chromatin loops as well as for differential analysis of chromatin interaction maps.

Unfortunately however the manuscript is very badly written, which makes it impossible to assess precisely how the algorithm works. The performances of the method are not benchmarked against any published alternative algorithm, the data presented are not convincing, and finally there is a number of non-explained (or incorrect) assumptions that require the analysis to be fundamentally improved and the manuscript to be entirely re-written to be understood (and make a real impact in the field).
Major points: 1. It is unclear how robust (or useful) the first two diagnostics related to sonication and DNA degradation, which are based on the proposed novel classification of reads, could actually be. The authors only show examples of 'good' Hi-C libraries in Figure 2A-B; it would be useful to show the same graphs for poorer quality libraries in order to give a sense of the effects that can be observed/distinguished, and how they can be related to library quality.
2. I have similar doubts on the third diagnostics (relative to the "ligation ratio"). First, also with reference at Figure 1B: what is the difference between 'dangling' and 'random' reads? I would have thought that what is called 'random' here is typically called 'dangling' in other algorithms used to process Hi-C reads (see for example Ref. 29), but I must have misunderstood? Second, if it correlates well with the percentage of cis interactions, one would be tempted to use the latter as a measure of ligation efficiency as it is currently done. I am not sure I understand the interest of the new annotation of reads at this point.
3. It is unfortunately very, very difficult to understand how the simultaneous bias removal and signal detection work in the Binless method (and to be honest how the entire package works!). The authors should consider rewriting the "Binned normalization" section entirely, so that it contains factual elements that would help the reader understand what exactly their algorithm does. In the current wording, it is not even clear what "signal detection" means to a non-Hi-C specialist (and still). In addition, the example provided in Figure 3 does not help understanding what the algorithm does. Does this synthetic map behave like an actual Hi-C map in terms of polymer scaling? If not, how can this be used to benchmark the performances of an algorithm that is designed to analyze data that live in a power-law scaling scenario? I had a hard time understanding what the results shown in Figure 3 mean, and why one should consider the 'Binless plus signal detection' results better than the others. I kind of see what this points at -but if the aim is to show that this last variant of the algorithm is able to pick up the off-diagonal interaction without making any assumptions on equal coverage, this should be definitely better explained and substantiated with more realistic examples. Figure 4-6 are not easy to interpret because of the choice of the color code. Also, it would be beneficial to enlarge the plotted region to emphasize that we are looking at a single TAD. Finally, chromosome numbers are missing and it is not clear what 'TBX3 locus' means: shall I interpret the locus of the TBX3 gene in human cells? 5. Figure 4D (referenced in the text) is missing! 6. What does 'statistical significance' mean in the context of a Fused Lasso regression? 7. More generally, and on the same line as point 3, the 'Binless detection' section does not explain exactly what the algorithm does -it rather seems more concerned with boasting the merits of it. For example, one key (and very clever!) idea here seems to be that binless matrices might be a much better way to plot Hi-C data and assess robust features that stand out against the random polymer behavior. But how are these matrices built? Do they emerge as a consequence of the Fused Lasso regression or are they an independent feature implemented by the code? This is absolutely incomprehensible in the text and would deserve a much clearer explanation. 8. How are spurious peaks defined, and based on which criteria they are removed by Binless? 9. A measure of the "dramatic increase in detection sensitivity" by Binless is missing: please provide a benchmark for the performance of the code.

The Hi-C examples shown in
10. How are "significantly enriched" loops defined? I couldn't find any description of which criteria are used. Also, and importantly: Binless is claimed to be a much better alternative to current loop caller algorithms, but it seems here that some additional criterion (external to Binless) is needed to identify the loops in the binless matrices. If not, then the authors should absolutely clarify this point. Otherwise how can "six loops" be defined in Fig. 7? 11. Figures 6A and B are not referenced in the text! 12. How are "significant differences" between replicates defined? 13. How are "false discoveries" defined in Table 1? With respect to which criteria?
Minor points: 1. Introduction: Genes within TADs do not tend to be 'co-expressed' but to be coregulated during cell differentiation of external stimulation. 2. It is not clear in the introduction why loop calling is in general problematic. 3. What are counter-diagonal biases in raw Hi-C matrices? 4. Figure 1B: arrows are too small and colors too similar to really understand what is plotted near the diagonal. What is the difference between 'other' and 'random' reads? 5. Figure 4A-B: what is the bin size there? Which threshold is chosen to call significant interactions?
Reviewer #3 (Remarks to the Author): Spill and collaborators introduce a new method to represent, normalize and detect chromatin conformation interactions that does not require binning of the genome. Thus, compared to available methods the so called 'binless' method has the advantage of being resolution free. The authors claim that their method can better normalize the inherent bias of Hi-C data. Also, the authors provide a method to detect Hi-C features like loops or TADs that is also resolution free.
In my opinion the paper explores a novel ways to treat Hi-C data that I think is useful for the researchers working on the field, specially for cases in which the Hi-C reads have been enriched in some way. However, in my opinion the manuscript requires some further work to present their method and their results more clearly. Following are some general suggestions to improve the manuscript and some questions.
1. The proposed normalization seems impractical for almost all real use cases as it is only limited to 100 restriction sites (even bacterial genomes have thousands restriction sites). For which applications is this normalization useful? 2. The fast approximation aims to improve over the limitations of the exact method. But, does this approximation also has limitations? Can you provide some information on the time and memory required to run Binless on some genomes (eg. human, mouse, worm, fly or yeast genome)? I presume that the results presented on the paper used this approximation. You should make this clear since the beginning.
3. The first paragraph of the discussion offers a more clear summary and justification for the paper than the introduction. I would suggest to move it to the beginning. Contrary to the introduction and results section, the discussion is more clear about what I think is the main message of the paper: a novel normalization method and a novel feature detection (fused lasso).
4. The first section of the results (Base-resolution view of hi-C data) is distracting with respect to the main message. Although I like how the authors represent the Hi-C data using arrows, the use of this representation is orthogonal to the main message and not used afterwards. What is important is the classification of pairs that is then used for the normalization. My suggestion is reduce this part and focus on the next section. Put the figures as supplement and use Figure 1 to explain the normalization method and the Hi-C pair classification strategy. Similarly, Figure 2 seems like standard quality control and does not merit to be a main Figure. 5. Since this is a methods paper, the section (Binless normalization) should contain more details to understand the justification for the method. Most of the current methods section should be part of this section. Hopefully, the authors can add a visual description of their method to help to understand it's merits. I found Supplementary Figure 3 quite relevant to the method, but is not a main Figure. 6. The authors said that they build upon the HiCNorm negative binomial regression framework. However, since the paper relies heavily in the use of negative binomials I find important to offer a clear justification for the use of this distribution. 7. Something I did not understand is why the authors have a full section called Binned detection (with Figure 4), to then argue that binned detection lacks sensitivity and should be replaced by their Binless detection. Thus, I think this section needs to be modified or removed and instead highlight and add more details their binless detection.
Other points 1. In the results section, the authors define the LR ratio, however this definition seems different than the LR ratio definition in methods. Furthermore, the authors use 'reads', in reference to their representation of Hi-C data pairs. But since 'reads' is normally used in the context of NGS sequencing data, frases like 'reads close to the diagonal' are not obvious to understand. I suggest that the authors use some other name like 'Hi-C pairs'.
2. The reason for the classification of Hi-C pairs is not entirely clear to me. There is some sort of standard classification of reads that can be seen for example in Figure 4 of HiCPro publication in Genome Biology but is also present in the quality control of many Hi-C publications. So read classification is nothing new. The class 'contact far', following the example on Figure 1D, contains inward reads that are separated by more than one restriction site (I assume as this is not clear). They could be separated by few bp or millions of bp. So the 'far' description is relative. Similarly, the class 'contact close' seems to contain outward facing reads (Fig. 1D). If they are flanked by two neighboring restriction sites, they are usually classified as 'self-circles', but not all outward facing reads are self circles or contain reads that are linearly close. I presume that 'self-circles' have their own category based on Fig. 1B. Finally, if the both reads map on the forward strand they are classified as 'contact up' and if both map on the reverse strand they are classified as 'contact down'. In general, why not call these pairs: 'inward', 'outward', 'both reverse', 'both forward' or something along this lines that make it clear what they are. Also, in Figure 1, the D part should be first as this explains the classification, then C which shows how the arrows are placed, then A an B.
In the supplementary methods, negative binomials are part of the model for c_far, c_close, c_up, c_down. However, I don't understand why those clases need to be treated differently. Hi-C pairs separated some kb can be in any orientation with respect to each other given the stochasticity of the ligation events. Maybe the authors can justify better their classification.
3. In the supplement, the description of the 'Exact Model' doesn't explain what RJ, DL and DR are. The authors should make an effort to explain their model as this is the core of the manuscript. 4. In the binless normalization section, the authors write: "at constant digestion rate, the number of dangling reads would drop with increase efficiency of ligation". As far as I know, the digestion rate applies to restriction enzymes, but I think that what the authors mean is constant ligation rate. Please revise.
5. In the same section. "First, normalization is performed prior to binning the data". This means that the normalization is applied per read or Hi-C pair? Or is this done by restriction site or restriction fragment? 6. The 'equal visibility assumption' is not used by the authors but this is not well justified. I could think that in methods like Hi-C chip or capture-C where certain regions are enriched the 'equal visibility assumption' does not hold but otherwise, this assumption is well justified. The use of the fake data to justify the invalidity of the equal visibility assumption does not seem to be fair because the construction of the fake data can be adjusted to suit the argument. 7. In the section 'Binned detection' the authors say "the polymer effect must be accounted for". For clarity, the authors should explain that they refer to the exponential decay of contacts with genomic distance. 'Polymer effect' could refer to a number of other unrelated things.
8. In the binless detection section says: "a binless matrix is a matrix whose bins adapt to the size of the features detected'. I find confusing that a binless matrix has bins. Also, the matrix representation used by the authors is not described. The binless matrix is what exactly? The Fused Lasso regression is applied to this matrix? 9. The methods section says that the input for the binless software is the reads intersection file of TADbit. But in the 'Figure and table generation' says that the input for binless are mapped pair-end reads (bam files presumably). Please revise which is correct.
10. In methods says: "Cis-trans ratio: Number of filtered reads arising within chr1, divided by total number of filtered reads with one end mapping to chr1". Is it 'with one end mapping to chr1 and the other end mapping to other chromosome" ? En general, for clarity I prefer to use intrachromosomal and inter-chromosomal contacts to avoid confusion.