Improving landscape inference by integrating heterogeneous data in the inverse Ising problem

The inverse Ising problem and its generalizations to Potts and continuous spin models have recently attracted much attention thanks to their successful applications in the statistical modeling of biological data. In the standard setting, the parameters of an Ising model (couplings and fields) are inferred using a sample of equilibrium configurations drawn from the Boltzmann distribution. However, in the context of biological applications, quantitative information for a limited number of microscopic spins configurations has recently become available. In this paper, we extend the usual setting of the inverse Ising model by developing an integrative approach combining the equilibrium sample with (possibly noisy) measurements of the energy performed for a number of arbitrary configurations. Using simulated data, we show that our integrative approach outperforms standard inference based only on the equilibrium sample or the energy measurements, including error correction of noisy energy measurements. As a biological proof-of-concept application, we show that mutational fitness landscapes in proteins can be better described when combining evolutionary sequence data with complementary structural information about mutant sequences.

• Multiple sequence alignements We used multiple sequence alignments (MSA) of sequences belonging to the Pfam Beta-lactamase2 family PF13354 for the inference of TEM-1 mutational landscape (resp. PDZ domain family PF00595 for PSD-95 mutational landscape) [30]. Raw data have been processed as described in [18]: we have used HMMer [29] to search against the Uniprot protein sequence database (version of March 2015); from the resulting alignments we have removed all sequences with more than 5 gapped positions; in order to take into account phylogenetic correlations and sampling biases we have associated to each sequence s m the following statistical weight: with d mm being the Manhattan distance (number of mismatches) between sequences s m and s m and Θ being the Heaviside step function whose value is zero for negative argument and one for positive argument. The reweighting threshold is set to ϑ = 0.8 as usually done in DCA inference [32]. The resulting MSA are N = 197 (resp. N = 81) sites long, and contain M = 2462 (resp. M = 3287) homologous sequences (average sequence identity ∼ 20% with the wild-type sequences).
• Computational predictions of thermodynamic stabilities We used ∆∆G-predictions for all single amino acid mutations from the wild-type sequences as estimated by PoPMuSic web-based tool [31] taking as reference structure the PDB: 1M40 for TEM-1 (resp PDB: 1BE9 for PSD-95).
• Experimental fitness measurements -TEM-1. Experimental measurements from [23] provide estimations of amoxicillin Minumum Inibitory Concentration MIC for n = 742 distinct single amino-acid mutations falling inside the part of the sequence covered by the corresponding Pfam model. Following [18] we have defined the experimental fitness as the logarithmic average over all MIC measurements of a specific mutant sequence (whenever multiple measurements for the same sequence were available).
-PSD-95. In [22] the effect of each single point mutations has been quantified as the log frequency of observing each amino acid β at each position i in the selected (sel) versus the unselected (unsel) population, relative to amino acid s i present in the wild type:

Potts Hamiltonian
The relevant description in the statistical modeling of protein sequences is a 21-state Potts Model, since each variable s i , i = 1, ..., N, can now assume 21 states (20 amino acids, one alignment gap).
The following Potts Hamiltonian provides the energy of any amino-acid sequence s = (s 1 , ..., s N ), where fields h i and couplings J ij are now respectively 21-dimensional vectors and 21×21-dimensional matrices. We fix the gauge freedom (cf.
[1]) using, for all i, j, the lattice-gas conditions h i (s ref is the reference wildtype sequence, for which mutations shall be predicted.

Fitness-energy mapping
Since model energies H(σ a ) can be in any non-linear relation with measured fitnesses φ a we replace Eq. (9) with: where p i,α (J, h) (resp. π emp i,α ) are the single marginals in the model (resp. in the empirical sample) specyfing the frequency of observing amino-acid α at position i, p ij,αβ (J, h) (resp. π emp ij,αβ ) are the pairwise marginals, and µ(x) is the robust monotonous non-linear mapping from energy to fitness introduced in [18]. The mapping is obtained by first sorting the model energies and experimental fitnesses and then associating the n th smallest energy H(n th ) with the n th highest experimental fitness value φ(n th ), µ H(n th ) = φ(n th ) .
Note that the second terms in Eq. (4) vanish when model energies are exactly in the inverse order of experimental fitnesses. We subsequently used the mapping to compute linear correlations between the mapped energies µ H(σ a ) and the experimental fitnesses φ a , resulting in non-linear rank correlations between experimental fitnesses and model energies.