Prediction of protein–ligand binding affinity from sequencing data with interpretable machine learning

Protein–ligand interactions are increasingly profiled at high throughput using affinity selection and massively parallel sequencing. However, these assays do not provide the biophysical parameters that most rigorously quantify molecular interactions. Here we describe a flexible machine learning method, called ProBound, that accurately defines sequence recognition in terms of equilibrium binding constants or kinetic rates. This is achieved using a multi-layered maximum-likelihood framework that models both the molecular interactions and the data generation process. We show that ProBound quantifies transcription factor (TF) behavior with models that predict binding affinity over a range exceeding that of previous resources; captures the impact of DNA modifications and conformational flexibility of multi-TF complexes; and infers specificity directly from in vivo data such as ChIP-seq without peak calling. When coupled with an assay called KD-seq, it determines the absolute affinity of protein–ligand interactions. We also apply ProBound to profile the kinetics of kinase–substrate interactions. ProBound opens new avenues for decoding biological networks and rationally engineering protein–ligand interactions.

'Count [ {"function": "optimizerSetting", "lambdaL2": "likelihoodThreshold": 0.0002 }, {"function": "addTable", "leftFlank": " ACACT "variableRegionLength":30," Supplemental Text -Pipeline.pdf Re-export This server, which can be accessed for academic use via probound.bussemakerlab.org, takes two types of files as input: a single JSON-formatted configuration file and one or more 'count tables' (see below). Upon completion, the server returns a webpage containing sequence logos and a JSON-formatted model file (see figure above). This model file is in the same format as those found on MotifCentral.org, our resource of high-quality pre-computed ProBound models used in this study. These JSON-formatted model files can be used by ProBoundTools.jar, a lightweight Java tool for computing affinities for arbitrary sequences. ProBoundTools.jar is available both on MotifCentral and at github.com/BussemakerLab. Count tables are tab-separated text files (TSVs) where every row contains the sequence and count of every observed probe in the library (i.e. the number of times each library sequence was observed in the data) as it progresses through the various selection rounds in a specific assay. Examples of count table layouts for some of the assay designs in this paper are shown in the table below. It should be pointed out that any integrative analyses -such as the creation of consensus binding models, co-factor interaction models or methylation-aware binding models -require multiple count tables corresponding to individual assays. For instance, constructing meCpG methylation-aware binding models requires two count tables: one for the methylated selection round and one for the unmethylated selection round. To make it easier to generate count tables from raw sequencing data, we have provided an easy-to-use Python script called buildCountTable.py. This script, which is included in the GitHub repository, takes sequence lists as input and returns correctly formatted count tables that can be used by the compute server. The JSON-formatted configuration file contains commands, called functions, that tell ProBound how the input data is structured and how to configure the various model components (i.e. how many binding modes to use, whether or not the binding modes should interact, the kind of selection process that generated the data, etc.). Functions can have attributes that further define these settings, and the configuration file conforms to the following template: The remainder of the document contains detailed descriptions of all the configuration file functions and their attributes, the structure of the JSON-formatted output file, and the configuration files that were used for the fits presented in the main text.

Configuration File Functions
The functions can be grouped into five categories: • gamma (list of floats): Vector containing initial γ r values for every round r in rhoGamma models enrichmentModelConstraints Function that defines the constraints on and optimization process of enrichment model parameters • fitRho (boolean, default is false): For rhoGamma models, should ρ be optimized?
• roundSpecificRho (boolean, default is true): For rhoGamma models, can each ρ r have an independent value?
• roundSpecificGamma (boolean, default is true): For rhoGamma models, can each γ r have an independent values?
• trySaturation (boolean, default is false): For SELEX models, check if setting binding Saturation=true improves the model.

Binding modes
addBindingMode Function that adds a binding mode and assigns it a running index, starting at 0 • size (integer, required): The width of the binding mode • flankLength (integer, default is 0): Distance into the fixed flanking region that is scored by the binding mode • dinucleotideDistance (integer, default is 0): Maximum distance between the two letters of the dimer sequence features that are included in the ∆∆G model. 0 inactivates the dimer features, 1 includes only adjacent letters, such as CG.
• singleStrand (boolean, default is false): Only score and include the forward strand in Z bound .
addNS Function that adds a non-specific binding mode (shorthand function for adding a mode with size=0) bindingModeSeed Function that defines the binding mode parameter seeds • index (integer): Index of the binding mode to be seeded; if no index is provided, seeding is applied to all modes • mononucleotideIUPAC (string): Seeds the binding mode to recognize sequences consistent with an IUPAC string. At each position, matches get β a,φ = 0 and mismatches get β a,φ = −1.
• mononucleotideString (string): Seeds the binding mode to recognize sequences consistent with a string. At each position, matches get β a,φ = 1 and mismatches give β a,φ = 0. The period character (.) is a wildcard and matches any letter.
bindingModeConstraints Function that defines the constraints on and the optimization process of binding mode parameters • index, integer, required: Index of the binding mode that will be manipulated • symmetryString (string, default is null): String that defines a symmetry on the binding mode. Two formats are possible: -The first format specifies a symmetry by using letters and digits to identify equivalent positions in the binding mode. Upper and lower case letters are related through complement and digits are self-complementary. For example, the string ab1BA specifies a reverse-complement symmetric binding mode of size five. Here complementarity relates a ↔ A, b ↔ B, and 1 ↔ 1. The string ab1BAab1BA specifies a 10bp binding site with a tetrameric symmetry. The pipe sign (|) is a barrier for dinucleotide interactions. This divides the binding mode into regions and removes dinucleotide interactions that connect different regions.

4/17
-The second format specifies a sequence of blocks that together fill in the binding mode. Each block is assigned an ID number and two block with the same ID have identical sequence recognition. A block with a negative ID is the reverse complement of a blocks with same but positive ID. Each block can be constrained to be reverse complement symmetric. For example, the symmetry string: 1:6:True corresponds to a 6bp reverse-complement symmetric block, 1:3:False,1:3:False corresponds to two concatenated 3bp blocks in head-to-tail configuration, 1:3:False,-1:3:False corresponds to a two 3bp blocks in the head-tohead configuration. Recognition of dimer sequence features that span blocks are prohibited.
Note that the footprint of a binding mode cannot be modified if a symmetry is specified since the expanded binding mode would no longer have the size specified by the symmetry string.
• roundSpecificActivity (boolean, default is true): Can binding mode activities have independent values across rounds (count table columns)?
• experimentSpecificActivity (boolean, default is true): Can binding mode activities have independent values across experiments (count tables)?
• experimentSpecificPositionBias (boolean, default is true): Can position bias parameters have independent values across experiments? Must be true if experiments have different probe lengths.
• optimizeSize (boolean, default is false): Should the size of the binding mode be optimized? If true, the binding mode is (separately) expanded to the left and to the right, the model parameters are re-optimized, and the expanded binding mode is kept if the likelihood improved more than likelihoodThreshold.
• optimizeSizeHeuristic (boolean, default is false): Same as optimizeSize, but the binding mode is symmetrically expanded to the left and right and the flank length is incremented • optimizeFlankLength (boolean, default is false): Should the flank length be optimized? If true, the flank length is incremented, the model parameters are re-optimized, and the new model is kept if the likelihood improved more than likelihoodThreshold.
• optimizeMotifShift (boolean, default is false): Should shifted versions of the binding mode be tested? If true, the motif of the binding model is shifted to the left and right (separately), the model parameters are re-optimized, and the new model is kept if the likelihood improved more than likelihoodThreshold.
• optimizeMotifShiftHeuristic (boolean, default is false): Same as optimizeMotif Shift, but only a single shift is tested. This shift is found by first computing the information content for each position in the binding mode, then computing the 'center of mass' of the information content, and finally computing the shift such that the center of mass is at the center of the binding mode.
• maxSize (integer, default is -1): Maximum binding mode size;-1 indicates no limit • maxFlankLength (integer, default is -1): Maximum flank length; -1 indicates no limit • informationThreshold (float, default is 0.1): Threshold on the information content (computed for the first two and last two bases in the binding mode), determining if optimizeSize and optimizeSizeHeuristic should attempt to expand the binding mode to the left and right.
• positionBiasBinWidth (integer, default is 1): Configures the set of possible binding configurations in the probe sequence to be partitioned into bins with specified width and constrains the position-bias parameters ω a (x) (where x is a configuration) to be constant in each bin, thus reducing the number of independent parameters. By default, each bin contains a single configuration and no constraint is thus imposed.
• fittingStages (list of JSON objects, default is []): This setting instructs the optimizer to explore variations of the binding mode using a sequence of fitting stages. Each fitting stage can use a different set of variations and is defined by a JSON object that maps the included variations to true. The variations are: symmetry Function that defines the symmetry status of a binding mode

Output
ProBound outputs the model in the form of a JSON Object with keys modelSettings and coefficients.
Here, modelSettings specifies how the model was configured and coefficients contains all inferred parameter values. The latter object has the following keys: • countTable: List of JSON Objects with the parameters for the count table models. Each object has the form: h: List containing the values of h r ≡ ln η r , where the index r runs over rounds.
• enrichmentModel: List of JSON Objects with the parameters for the enrichment models. The only enrichment model with parameters is rhoGamma: rho: List containing the values of ρ r where the index r runs over rounds.
gamma: List containing the values of γ r where the index r runs over rounds.

6/17
• bindingModes: List of JSON Objects with the parameters for the binding modes. Each object has the form: activity: Two-level nested list containing the log-transformed binding mode activities ln α e,r , where the indices e runs over experiments (count tables) and r runs over SELEX rounds (columns in  the table).
mononucleotide: Single-level list containing the mononucleotide binding mode coefficients in ⃗ β a for binding mode a. This list can be thought of as a flattened PSAM: Letter c at position x in the PSAM has index L * x + c, where L is the length of the alphabet. For the standard alphabet this corresponds to: {β A,1 , β C,1 , β G,1 , β T,1 , β A,2 ...}.
dinucleotide: Two-level list containing the dinucleotide binding mode coefficients in ⃗ β a for binding mode a. The first index specifies the spacing between the interacting letters (0 is  NN, 1 is N.N, etc). The second index can be thought of as a flattened dinucleotide PSAM: A dinucleotide feature with letters c 1 and c 2 and with the first letter on position x has index L 2 x + Lc 1 + c 2 , where L is the length of the alphabet. For the standard alphabet this corresponds to {β AA,1 , β AC,1 , β AG,1 , β AT,1 , β CA,1 ...}.

ProBound configuration used in paper
ProBound was run with a variety of settings in order to learn the binding models shown in the figures. The corresponding configuration files are provided below. These settings utilize the two functions addTableDB and oututDB that only work in our internal computational environment, but both these functions can be substituted. For example, {"function": "addTableDB", "count_table_id": 2600 } loads a count table with internal count table ID 2600 using our database. This function call should be replaced with: {"function": "addTable", "countTableFile": "UbxIVa-Hth-Exd.30mer1.tsv.gz", "inputFileType": "tsv.gz", "variableRegionLength":30, "nColumns": 4, "leftFlank": GTTCAGAGTTCTACAGTCCGACGATC, "rightFlank": CCCGGGTCGTATGCCGTCTTCTGCTTG } The variable values for all count tables used below can be found in Extended Data Table S1 and S3. This table also contains the accession numbers for the published sequencing data used to generate the count tables (such as UbxIVa-Hth-Exd.30mer1.tsv.gz). The second internal function is 7/17 {"function": "outputDB", "fit_id": 6595 } This function sets the ProBound output files using our internal database. This function call should be replaced with {"function": "output", "outputPath": "/path/to/output", "baseName": "fit", "printTrajectory": true, "verbose": true } This function directs the output to the directory "/path/to/output" and names of the output files start with fit. Finally, some of the settings below seed the binding mode to have the sequence readout at the center. The seeding strings were based on earlier unseeded fits that are not shown. These unseeded fits explored different sizes, shifts, and flank lengths of the binding modes using optimizeFlankLength, optimizeMotifShiftHeuristic, and optimizeSizeHeuristic as illustrated by the first setting below.

TF binding models, single-experiment
In benchmarking ProBound, each training dataset was analyzed using three settings and the best binding model was then selected based on its ability to explain the training data (see Methods). The first setting utilized one non-specific binding mode (constant across sequences) and two PSAM binding modes. The size, frame shift and flank length of the PSAM binding modes were all optimized sequentially: Here metadata for each count table (variableRegionLength, nColumns, leftFlank, rightFlank, and, when available, data accession numbers) is available in Extended Data Table S1. The second binding setting was equivalent to the first except for two changes: the non-specific binding mode was replaced by a 1bp PSAM that can absorb some sequence bias, and only the first and lasts available SELEX round was used: Here rFirst and rLast should be replaced with the zero-based index of the first and last available SELEX round. The third setting was also identical to the first except it learned three PSAM binding modes: