High-quality genome (re)assembly using chromosomal contact data

Closing gaps in draft genome assemblies can be costly and time-consuming, and published genomes are therefore often left ‘unfinished.’ Here we show that genome-wide chromosome conformation capture (3C) data can be used to overcome these limitations, and present a computational approach rooted in polymer physics that determines the most likely genome structure using chromosomal contact data. This algorithm—named GRAAL—generates high-quality assemblies of genomes in which repeated and duplicated regions are accurately represented and offers a direct probabilistic interpretation of the computed structures. We first validated GRAAL on the reference genome of Saccharomyces cerevisiae, as well as other yeast isolates, where GRAAL recovered both known and unknown complex chromosomal structural variations. We then applied GRAAL to the finishing of the assembly of Trichoderma reesei and obtained a number of contigs congruent with the know karyotype of this species. Finally, we showed that GRAAL can accurately reconstruct human chromosomes from either fragments generated in silico or contigs obtained from de novo assembly. In all these applications, GRAAL compared favourably to recently published programmes implementing related approaches.

The following provides a more detailed description of the algorithm implemented in GRAAL. 100

101
The description of GRAAL can be divided into two main components: (i) the probabilistic model 102 that assigns a likelihood to a given linear (one-dimensional) genome structure given a specific 103 contact/Hi-C data set, and (ii) the sampling algorithm used to explore the space of linear genome 104 structures (and nuisance parameters). We consider the genome assembly problem as a Bayesian inference problem, taking inspiration 110 from previous work in protein structure determination 3 . In its simplest form, the Bayes rule reads: 111 where G denotes the linear genome structure to be determined, D is the Hi-C data set (both will 112 be defined more precisely below), ( ) is the probability density of A conditioned on B, and 113 indicates proportionality. Our goal is to determine, or at least approximate, the posterior 114 probability ( ). The above formula provides a means to compute this probability density (up 115 to a normalizing factor) given a probabilistic data generation model, ( ) (called likelihood) 116 and data-independent assumptions about the structure, encapsulated by the prior probability 117 ( ). 118

119
In practice, our data generation model involves several parameters (called nuisance parameters) 120 that are not known a priori (see below). Therefore, we include these parameters, collectively 121 noted as , in the Bayesian formulation, yielding: 122 where for the latter identity we assumed statistical independence of the genome structures G and 123 the nuisance parameters . 124 We next assume that in absence of data, all possible genome structures and nuisance parameters 126 are equally probable, i.e. that ( ) and ( ) are constants (flat priors). With these assumptions 127 the Bayes rule reduces to: 128 To compute the likelihood ( ) we need a data generation model that relates the contact 130 frequencies measured by the Hi-C experiment to an assumed linear genome structure and the 131 nuisance parameters. First, we define G as an unordered set of N contigs : 139

{ }
If the genome is perfectly assembled, each contig corresponds exactly to a single chromosome. 140 Hi-C reads are mapped to restriction fragments defined by the restriction enzyme cutting sites. 141 We therefore consider the restriction fragment as the elementary units of a genome assembly. 142 However, many operations performed by GRAAL are not applied to individual restriction 143 fragments, but to ordered sets of p consecutive fragments, which we call 'bins' and note : 144 Whenever possible, we choose where is a single user-defined constant typically set to 145 3. However, if the number of restriction fragments in a contig is not a multiple of , then some 146 bins will consist of fragments. 147 We define a contig as an ordered sequence of bins, noted: 148 is the number of bins in contig and is 149 The subscript in is used to indicate that the bins rely on an initial assumed set of contigs . We also introduce the two functions ( ) and ( ) 151 Next, we define ( ) as the genomic distance (in units of base pairs) between two bins. This 153 distance is obviously only defined for bins belonging to the same contig, i.e. for ( ) ( ). 154 Hi-C/3C-seq contact data: 157 The chromosome contact data used by GRAAL are obtained after mapping the Hi-C/3C-seq 158 reads to an initial set of fragments . We define D as the matrix whose entries 159 are the number of Hi-C3C-seq reads pair mapped to each pair of fragments 160 ( ). Please note that, although the sampling algorithm of GRAAL (described below) 161 manipulates the genome structure at the level of super-contigs, the likelihood will always be 162 evaluated by considering the contact data at the resolution of individual fragments. We now need a means to relate the probability of the matrix D to the assumed linear genome 167 structure G. Our first step is to relate the probability of each entry to the contact probability 168 between fragments and , which we note . Since results from a counting process, its 169 probability can be modeled as a Poisson distribution: 170 where is the total number of independent counts. Although is strictly speaking also a 171 random number (which depends chiefly on sequencing depth and criteria used to validate the read 172 pairs), for simplicity we treat it as a constant and simply set: ∑ ∑ . We further 173 assume that contacts between distinct pairs of fragments are independent from each other, such The contact probabilities between loci on distinct chromosomes (trans contacts) are less 191 amenable to simple theoretical predictions and arguably more sensitive to biological specificities 192 such as organism and cell type. They are also on average much weaker than cis contact 193 probabilities. For simplicity and generality, we therefore simply assume that trans contacts have 194 the same probability as long-range cis contacts: 195 Our probabilistic model is therefore characterized by only 3 parameters, collectively noted as : ).
Because these parameters cannot reliably be predicted a priori, they will be sampled together 198 with G as will be detailed below. Different initializations can be considered for the initial set of contigs depending on the 229 availability of a preliminary assembly of the organism under study, or of a related genome. The 230 initial set of contigs does not need to be perfect, since GRAAL can split and rearrange incorrectly 231 assembled contigs. However, in our current implementation, the restriction fragments and the 232 bins , whose definition depends on , cannot be broken. In our paper, we considered the 233 following different types of initializations: 234  The reference budding yeast genome (16 chromosomes; GCF_000146045.1) was used as 235 validation data, since a high quality assembly of this genome is already available. In order 236 to simulate an incomplete assembly of this genome, we split the genome into 1,086 237 bins (of approximately 11Kb) to initialize GRAAL. presenting little or no read coverage. If reads appears sparse along a RF compared to 251 the distribution of read coverage over the entire population of RFs, the RF was discarded. 252 If the entire contig appeared undercovered, it was therefore discarded. A similar filtering 253 step is used by dnaTri2. We then split the remaining 2,917 contigs into bins of 3 RFs. 254 As a general strategy to complete the assembly of an imperfectly assembled genome, we 255 recommend starting from the existing contigs and splitting them into bins as illustrated here for 256 259

B.2. Initialization of the nuisance parameters, 260
The initial values of the parameters ( ) are obtained based on the Hi-C data D and 261 the initial genome structure as follows: 262 First, the initial value of is set to the contact probability averaged over all pairs of bins 263 belonging to different contigs, i.e.: 264 where if i=j and otherwise. 265 where H is the Heaviside function ( ( ) for and ( ) otherwise) . We then 270 estimate the initial values of and by least squares fitting of (Eq1) to , i.e.: 271 This minimization is performed using a quasi-Newton method 12 . We now introduce some notations that will be important for the following section. First, we call 359

{
} the set of all 9 mutations and use the generic notation 360 with { } for individual members of this set (for example, is the paste mutation. We 361 point out that each of the 9 mutations can be defined by either one (mutations ) or 362 two indices of bins (mutations , ) and, for some mutations, one or two auxiliary binary 363 parameters . To formally note the parameters of an arbitrary mutation in , we can 364 therefore use the notation: ( ), where it is understood that j is relevant only for mutations 365 . Note that this measure is quite sensitive to assembly errors, since any 453 displacement of a bin from its true position (irrespective of the magnitude of this 454 displacement) and any incorrect orientation will increase E. 455