AlphaLink: integrating crosslinks into AlphaFold2 via OpenFold

Crosslinking MS data have been used to guide candidate selection for AlphaFold-multimer in protein–protein interaction studies and validate models17,18. To fully leverage the potential of crosslinking MS data in protein structure prediction, we develop AlphaLink, a framework incorporating crosslinks directly into OpenFold19. OpenFold is a trainable reproduction of AlphaFold2. The creators of OpenFold verified that the implementation produces identical results. OpenFold primarily exploits co-evolutionary relationships. The main difficulty in merging multiple information sources is to find a suitable representation that facilitates integration and at the same time avoids information loss. OpenFold operates both in distance space (Evoformer) and in 3D space (Structure Module). Photo-AA crosslinking MS data provide distance restraints that naturally fit into the distance space of OpenFold, since they yield similar distances to co-evolutionary contacts by directly linking amino acids via diazirine chemistry. Co-evolutionary relationships and photo-AA crosslinks provide complementary and corroborating information. The sparsity of crosslinks can be compensated with co-evolutionary information. Accurate crosslinking data can act as an anchor in these cases. AlphaLink exploits this relationship by merging crosslinking MS and co-evolutionary data via the Evoformer, injecting crosslinks into the pair representation (z), yielding a consistent and unified constraint set (Fig. 1).

Fig. 1: Information flow in AlphaLink. a, Overview of the information flow in AlphaLink. Crosslinks (blue) are embedded and added onto the pair representation (green). Impact of crosslinks shown in red. b, Crosslinks influence the retrieval of co-evolutionary information. They are used as a bias in the MSATransformer. c, The pair representation is updated with information from the MSAs that have been biased with the crosslinks. Full size image

We introduce two representations to encode crosslinking information. The experimental data are represented as either soft labels or distance distributions (distograms). In the case of soft labels, each contact is weighted by the link-level false discovery rate (FDR) of the dataset (1-FDR) or, if present, the per-restraint FDR to indicate confidence in crosslink assignment. Distograms allow us to generalize to arbitrary distance restraints. A particular crosslinker (or distance restraint in general) is represented by a distance distribution. Contact-like restraints can be represented by uniformly distributed distograms for the given cutoff. We model uncertainty directly in the representation by adjusting the probability mass according to the FDR. The distogram is designed to match the distogram that is predicted by the Evoformer from the pair representation that consists of 64 bins. We use the same binning for the first 64 bins and extend the distogram further to 128 bins, spanning from 2.3125 Å to 42 Å.

We embed the restraints and add them to the pair representation of OpenFold, which is later mapped into 3D space (Fig. 1a). The embedding is similar to the recycling embedding in AlphaFold2. The Evoformer jointly updates the MSA and the pair representation. The MSA transformer (Fig. 1b) retrieves co-evolutionary information and updates the MSA representation. The retrieval is biased with the pair representation that includes the experimental crosslinking information supplied by the user. The outer product mean (Fig. 1c) in turn updates the pair representation. This coupling maximizes synergy between MSA and experimental information and allows the network to perform noise rejection, that is, the rejection of misassigned experimental or co-evolutionary relationships or of contacts that do not support other strands of information leading to a consensus model.

We initialized OpenFold with the original weights of AlphaFold2 and fine-tuned the network with the newly added crosslinking bias. We followed the refinement training regime outlined in the AlphaFold2 paper, except that we subsampled the number of effective sequences (N eff ) to simulate challenging targets. In light of the limited availability of experimental crosslink data for training, we simulated photo-crosslinking MS data (Methods) that included simulated experimental noise in the form of false residue–residue contacts at the given FDR.

Integrating photo-AA crosslinks enables noise-tolerant prediction of challenging targets

We tested AlphaLink on 49 challenging CAMEO targets (N eff ≤ 25, no MSA subsampling, Supplementary Data 1) (Fig. 2a). AlphaLink outperforms AlphaFold2, substantially improving the performance on targets with more than 20 crosslinks. Integrating simulated photo-L data improves the TM score on average by 19.2 ± 16.3% (95% confidence interval) (Fig. 2a). Encoding the crosslinks as distograms instead performs virtually the same (Extended Data Fig. 1a).

Fig. 2: AlphaLink performance comparison against AlphaFold2. a, TM score comparison on 49 CAMEO targets with N eff ≤ 25. Error bars represent the 95% confidence intervals (N = 10). Points show the mean. TM score improves on average by 19.2%. b, Performance on 60 CASP14 and 45 CAMEO targets broken down by TM score (N eff = 10). AlphaLink improves on average by 15.2%. Number of targets in each range bin in brackets below. c, Performance on 60 CASP14 targets (N eff = 10) with different noise levels (FDR 0%, 5%, 10%, 20% and 50%). AlphaLink improves in the median for all noise levels. Performance shows robust noise rejection. Dotted line shows median performance of AlphaFold2. d, Predicted aligned error of AlphaFold2 (left) and AlphaLink (right) on T1064 with N eff = 10 (top) and predicted structures (bottom). Light regions signify high uncertainty. Sparse restraints decrease uncertainty across the whole protein. Satisfied crosslinks <10 Å Cα–Cα highlighted in blue, borderline crosslinks (10–15 Å Cα–Cα) in yellow, and violated crosslinks >15 Å Cα–Cα in red. Possible crosslinking sites (leucines) are shown as spheres. Regions with violated crosslinks in the AlphaFold2 prediction (left) increase in certainty (darker regions). TM score improves from 0.28 to 0.86. e, Performance on 60 CASP14 targets (N eff = 10) as a function of MSA size (N = 100, 10 MSAs and 10 crosslink sets). Dots represent the mean percentage of nonsatisfied crosslinks (>10 Å Cα–Cα) in the AlphaFold2 prediction. Improvement on average for all but full MSA size. Crosslink violation decreases and crosslink utility diminishes with increasing MSA size. Largest utility for N eff < 25. f, Performance without MSAs on 60 CASP14 and 45 CAMEO targets. AlphaLink predicts the correct fold (TM score >0.5) for 43/105 (13/105 for AlphaFold2). Error bars represent the 95% confidence interval (N = 10). Points show the mean. In all box plots, the line shows the median and the whiskers represent the 1.5× interquartile range. Source data Full size image

We further curated a second benchmark dataset consisting of 60 CASP14 targets and 45 CAMEO targets (Supplementary Data 1). To simulate challenging targets and to control for the MSA influence, we subsampled the MSAs to N eff = 10 and ignored structural templates. Here AlphaLink improves the TM score on average by 15.2% (Extended Data Fig. 1b). For particularly challenging targets (N = 28), where AlphaFold2 fails to predict the correct fold (TM score ≤0.5), the TM score improves on average by 50.6% (Fig. 2b). AlphaLink predicts the correct fold (TM score >0.5) of 14 of these. We tested the noise rejection capabilities of AlphaLink on 60 CASP14 targets by adding false links to simulate multiple noise levels. The performance is roughly constant with 10%, 20% or 50% false links (Fig. 2c) and still outperforms AlphaFold2, demonstrating AlphaLinks’ robustness to different noise levels. Overall, the method achieves a crosslink satisfaction (<10 Å Cα–Cα) on average of 85 ± 1.2% (95% confidence interval) after three recycling iterations, and 88.3 ± 1.2% (95% confidence interval) of the simulated crosslinks with <10 Å Cα–Cα in the crystal structure are satisfied.

The sparse crosslink data act as anchor points that serve to pull the entire prediction towards the right solution (Fig. 2d). For CASP target T1064 (N eff = 10), four crosslinking restraints are sufficient to both drive the prediction to the native state (TM score improves from 0.28 to 0.86) and to decrease the predicted aligned error across the whole protein, including areas not covered by the crosslinking data. The crosslinking information has a wide-ranging impact due to its combination with the co-evolutionary and structural information embedded in the pair representation, which is used as a bias to retrieve contacts consistent with experimental data. Effectively, this improves the efficiency of using co-evolutionary information in AlphaFold2. Extended Data Fig. 1c shows the effect of using different distograms to encode a restraint between residues 11 and 103 in T1064. The Evoformer predicts a narrower distogram when using the expected distance distribution of photo-AA crosslinks as a prior, when compared with the uniform prior of an upper bound distance restraint. This representation slightly improves the prediction (TM score 0.68 to 0.7). The performance as a function of the number of crosslinks per residue is shown in Extended Data Fig. 1d. The performance generally increases with an increase in the number of crosslinks per residue. The main advantage of the distogram representation is enabling the user to inject distance restraints from different crosslinkers or even different experimental approaches into AlphaLink.

We test the performance of AlphaLink at different N eff levels to investigate the effect of crosslinks on targets with varying difficulty (Fig. 2e). The performance of both AlphaFold2 and AlphaLink deteriorates in absence of sufficiently large MSAs (Fig. 2e). Crosslinks can compensate for smaller MSA sizes. In fact, photo-AA crosslinks alone without any MSA information allow us to predict the correct fold (TM score >0.5) of 43/105 benchmark targets, compared with 13/105 for AF2 without MSA information. The mean improvement in TM score increases to 75 ± 13.5% (95% confidence interval) over all targets (Fig. 2f). The benefit of crosslinks slowly disappears with a N eff > 50. This is at least partly due to the fact that most crosslinks are already satisfied when predicting with full MSAs (Fig. 2e). Rather than finding any solution that fits the crosslinks, our network appropriately weighs crosslinking MS information against the MSAs and uses it to guide the prediction to a more accurate solution. Note that as MSA size increases, the network will rely more on MSA information than on crosslinks—hence, we implement settings with different MSA subsamplings in the AlphaLink software package.

In summary, AlphaLink enables users to use sparse distance restraints to bias AlphaFold2 predictions, robustly handling noise, directly at the inference stage, due to their synergistic implementation in the network design.

Photo-L as an in situ structural probe

To generate a large-scale experimental photo-AA dataset required for testing such an application, we derived in situ structural restraints on the E. coli membrane fraction by crosslinking MS of cells grown on photo-L-containing medium. We optimized the growth protocol to maximize incorporation while maintaining a low level of cytotoxicity (750 μM photo-L in the medium, Extended Data Fig. 2a), ultraviolet (UV) illuminated the cells for crosslinking and then enriched the cell membrane of the crosslinked cells. The proteins were digested, and the resulting peptides subjected to two-dimensional fractionation, combining strong cation exchange and size exclusion chromatography (Extended Data Fig. 2b). Mass spectrometric analysis then led to the identification of 615 residue pairs involving 112 proteins at 5% link-level FDR (Fig. 3a, Extended Data Fig. 2c and Supplementary Data 2 and 3). Several crosslinks are detected among β-barrel proteins and proteins in the intermembrane space, including porins and known membrane complexes (Fig. 3a and Extended Data Fig. 2a). When visualized on known protein structures, the experimental crosslinks provide a median distance of 11.1 ± 8.1 Å Cα–Cα (mean ± standard deviation) (Fig. 3b), indicating the contact-like nature of these crosslinks in line with their implementation in AlphaLink. This is further supported by the fact that we exclude crosslinks within the same tryptic peptide, and between consecutive peptides in our analysis.

Fig. 3: In situ photo-L crosslinking MS in E. coli. a, Distance restraints from in-cell photo-L crosslinking MS mapped onto cellular complexes. b, Distance distributions for photo-L crosslinks mapped onto known structures, taking only a single conformer per protein. Bissulfosuccinimidyl suberate (BS3) and disuccinimidyl sulfoxide (DSSO) distograms are obtained by mapping the crosslinks of Lenz et al. 33. The distograms are derived by accounting for homo-multimers (top) or mapping to only within-chain distances (bottom). c, Distance restraint analysis of outer membrane proteins crosslinked with photo-L. The dot represents the median, and the whiskers represent the 1.5× interquartile range. Source data Full size image

Photo-L provides validation for the in situ conformation of multiprotein complexes such as the AcrAB-TolC multidrug efflux pump, ribosome and ATP synthase (Fig. 3a). The crosslinks are consistent with previously characterized conformations of the bacterial outer membrane barrel assembly machinery (Bam). However, a link between the P2 and P3 domains highlights the flexibility of these modules (Fig. 3c), which are known to undergo large structural rearrangements in outer membrane protein folding and insertion. A total of 153 crosslinks are detected for the highly abundant protein OmpA. OmpA is made up of a β-barrel connected via a 20-residue linker to a C-terminal domain. It is also known to oligomerize in vivo, and this interaction is thought to be mediated by the C-terminal domain. The crosslinks between the β-barrel, linker and C-terminal domain highlight the relative flexibility of these modules (Fig. 3c) and point to potential contacts made between multiple copies of the C-terminal domain. In several plug-containing β-barrel proteins, such as FhuA and BtuB, photo-L links the position of the central plug with the membrane barrel in a way that is consistent with previous structures (Fig. 3c), validating the arrangement of these two modules in the functional cycle of the proteins. These crosslinks highlight the potential of photo-L to provide in situ residue–residue contacts regardless of solvent accessibility, providing insight into function for critical domain contacts.

Structure prediction with in situ photo-L data

To test AlphaLink on experimental data, we predicted the proteins in the crosslinking MS dataset of the E. coli membrane fraction. We focused our evaluation on the 31 targets with high-resolution structures that had a median of five crosslinks (Fig. 4). Each target was predicted with ten randomly subsampled MSAs at N eff = 10, yielding 310 predictions (Supplementary Data 4). We subsampled the MSAs to counter overfitting, because the targets were probably part of the AlphaFold2 training set. Even with N eff = 10, 65% of the AlphaFold2 predictions exceed a TM score of 0.8. AlphaLink improves performance measured by TM score on average by 5.2 ± 1.9% (95% confidence interval) across all proteins relative to AlphaFold2.

Fig. 4: Structure prediction with in-cell photo-L crosslinking MS data of the E. coli membrane fraction. a, Comparison of TM score with annotated number of links (marker sizes) and percentage of nonsatisfied (>10 Å) crosslinks (color gradient) in the AlphaFold2 prediction. Performance improvement is bigger for targets with a higher percentage of nonsatisfied crosslinks in the base prediction (darker circles). Each target is predicted ten times with different MSA subsamples at N eff = 10. AlphaLink outperforms AlphaFold2 on average. b, Comparison of TM score with annotated mean distance of nonsatisfied crosslinks in the base AlphaFold2 prediction (color gradient). Prediction quality improves with stronger crosslink violations (darker circles). c, We show the calibration of the pTM. On predictions that are at least 80% covered by the crystal structure, the correlation is 0.75. The true TM score is generally underestimated, meaning that the pTM score of AlphaLink is a conservative estimate. The shaded area corresponds to the 95% confidence interval. Line shows the linear fit. d, Prediction of the ATP synthase subunit AtpB by AlphaFold2 and AlphaLink using in-cell photo-L crosslinks at N eff = 10. e, Prediction of the outer membrane lipopolysaccharide assembly protein. f, Prediction of the ferrienterobactin receptor. In all three cases, the in-cell crosslinking data helps AlphaLink position different regions of the protein relative to each other, yielding a performance improvement over AlphaFold2. The crystal structure of the target protein is shown in gray, overlaid with the AlphaLink prediction. Source data Full size image

On targets where AlphaFold2 does not provide accurate models (TM score <0.8), AlphaLink with experimental data improves the TM score on average by 15.9 ± 4.6% (95% confidence interval). The improvement increases to 47.8 ± 24.8% (95% confidence interval) for AlphaFold2 predictions below a TM score of 0.5. We predict the correct fold (TM score >0.5) for ten additional proteins. This shows that simulated crosslinking MS data successfully model the features of experimental photo-AA restraints. For the 204 AlphaFold2 predictions with a TM score of 0.8 or higher, the performance is unaffected. At high TM scores, side-chain conformations begin to play a role, and crosslinking MS data do not have the resolution necessary to improve side-chain predictions.

To better judge the utility of the crosslinks for a given target, we include the percentage of nonsatisfied crosslinks in the baseline AlphaFold2 prediction (Fig. 4a) and also consider the mean distance of the nonsatisfied crosslinks in the AlphaFold2 prediction (Fig. 4b). We set the cutoff for violated crosslinking restraints to 10 Å Cα–Cα in the crystal structure. Many targets are not completely covered by the crystal structure. Therefore, we can analyze only a subset of the crosslinks. Crosslinks that are already satisfied in the AlphaFold2 predictions do not contribute novel information. On average, there are 0.5 violated crosslinking restraints per prediction at a cutoff of 10 Å Cα–Cα. Indeed, the TM score improvement of AlphaLink generally increases wherever AlphaFold2 makes a prediction containing unsatisfied crosslinks. We further show that the predictions that improved the most have unsatisfied crosslinks with large distances in the baseline prediction (Fig. 4b). Here crosslinks add the most value, and for some predictions a single crosslink is enough to improve the quality considerably (TM score 0.39 to 0.86 for target AtpB). Extended Data Fig. 2d shows two examples where adding crosslinks negatively impacts the prediction quality. In the case of OmpF there are multiple overlength crosslinks (highlighted in red in the native structure) that might stem from crosslinking different subunits, since OmpF is a homo-multimer. For the ATP synthase α subunit there is one overlength crosslink that is probably a false positive. Here, although the link is rejected in the end, it still induces a domain movement that leads to a worse prediction.

To investigate the correlation between predicted and true TM score for the predictions of the membrane fraction, we compute the fit on the predictions where the crystal structure covers at least 80% of the protein (Fig. 4c). The Pearson correlation coefficient is 0.75. We generally underestimate the true TM score. The correlation is in line with the baseline AlphaFold2 model (Extended Data Fig. 3), indicating that model confidence estimates of AlphaLink are comparable to AlphaFold2, allowing for users to reliably interpret predictions.

Extended Data Fig. 4 shows the predicted TM score (pTM) on a total of 96 targets, which include proteins where no structure is available. Each protein was predicted with one randomly subsampled MSA (N eff = 10). The pTM indicates possible improvements over AlphaFold2 on these structures as well.

Probing conformational dynamics in situ

To probe whether experimental distance restraints can act as anchors to drive predictions towards different energy minima in multistate proteins, we simulate a proof-of-concept experiment on the human cyclin-dependent protein kinase Cdk2, a drug target in cancer therapy20. Activation of Cdk2 in the S phase proceeds via a conformational change in the T-loop (residues 145–165) and the PSTAIRE helix (residues 45–55) triggered by binding of cyclin A21. There are several structures of Cdk2 in various states of activation22,23. If Cdk2 is predicted without structural templates with AlphaFold2 (N eff = 10), the T-loop is predicted in an intermediate conformation between the apo, auto-inhibited state and the cyclin A-bound conformation (Fig. 5a). Presumably, the intermediate conformation of this loop in the AlphaFold2 prediction is a consequence of co-evolutionary information driving it towards both the open and the closed state. When run with full MSA information, all AlphaFold2 predictions converge to the cyclin A-bound state (Extended Data Fig. 5a), failing to predict the inactive conformation.

Fig. 5: Photo-AA data guiding prediction of specific conformational states. a, Left: structures of the monomeric, inhibited conformation of Cdk2 (teal)34 and the cyclin A-activated conformation (salmon)35 overlaid with the AlphaFold2 prediction of Cdk2 performed at N eff = 10. Right: focus on the T-loop and PSTAIRE helix involved in protein activation, with the two photo-AA restraint sets fed to AlphaLink colored according to the corresponding protein state. b, Comparison of the AlphaFold2 prediction with the two predictions of AlphaLink made with restraint sets corresponding to the active or inactive conformation of Cdk2, showing that the photo-AA data drive the prediction to either the active or inhibited conformation. c, Middle: overlay of the AlphaLink prediction with the crystal structure for the inhibited state. Right: overlay of the AlphaLink prediction with the structure for the cyclin A-bound state, showing the entire conformation of the loop is correctly predicted despite only sparse restraints being present. d, Outcome of predicting with a combined set of restraints. At low N eff values, the crosslinks drive the prediction towards the cyclin E-bound state. As the MSA information increases, the prediction is steered more towards the inhibited state and closer to the AlphaFold2 prediction. Full size image

We simulate two photo-crosslinking MS experiments in which the protein was acquired in either its inhibited or in its cyclin A-bound states, generating two sets of sparse restraints for the T-loop (Supplementary Table 1). Such experiments may be carried out on the purified protein or in cells before protein purification. We then predict the Cdk2 structure using AlphaLink with these restraints, showing that the loop structure is driven towards the appropriate conformation (Fig. 5b). The crosslinks act as anchor points positioning the whole T-loop in the appropriate configuration for the cyclin A-bound state, with a Cα r.m.s.d. of 1.24 Å on residues 145–165 to PDB 2bpm (Fig. 5c). The same is true for the inactive state of the loop. In this case however, lack of leucine and lysine residues around T160 in the structure leads to a lack of sufficient restraints to capture the fully closed loop conformation, leading to a slightly higher Cα r.m.s.d. to the target structure (3.19 Å to 1h01), while still outperforming the AlphaFold2 prediction (Cα r.m.s.d. 6.29 Å to 1h01). This higher r.m.s.d. is also consistent with the fact that the T-loop is not rigid in its inhibited, dephosphorylated state, as highlighted by multiple crystal structures and molecular dynamics simulations24. AlphaLink successfully folds the short helix within the T-loop (residues 147–153) in the inactive state, and unfolds them into an extended conformation when given restraints for the cyclin A-bound state. It also correctly predicts the position and rotation of the PSTAIRE helix, despite having only two restraints in this region in the inactive conformation dataset, and three in the active dataset. In the case of a mixture of restraints, the prediction converges on the cyclin-A bound state at N eff = 10 (Fig. 5d). This conformer is not produced by AlphaFold2. Increasing the MSA steers the prediction towards a middle ground that is more similar to the AlphaFold2 prediction. We interpret this as the algorithm performing noise rejection on a subset of crosslinks in the mixture and using the rest as anchor points to drive a prediction towards a particular solution.

To further show the influence of the MSA, we predict the conformation of the fold-switching protein KaiB (Extended Data Fig. 5b) with photo-L crosslinks simulated for the ground state, the fold-switched state or a mixture added on top of random sets of simulated photo-L crosslinks. At low N eff , AlphaLink predicts both conformers accurately when given unique sets of crosslinks, but as MSA evidence gets larger, the prediction converges to one state for both sets. This result reproduces the outcome of running AlphaFold on KaiB with different, clustered subsamples of the MSA25. Predictions with mixed crosslinks lead to different outcomes at different N eff values, as observed in the case of Cdk2, pointing to the fact that crosslinking is weighted against the MSA depending on the information content and size of both strands of information. In multiple simulated crosslinking datasets for the protein selecase (Extended Data Fig. 5b), even without MSAs, most predictions end up in the conformation observed in the monomeric state of the protein state, although some predictions corresponding to the bound state are observed when given unique crosslinks in the absence of MSA information.

These results demonstrate that AlphaLink can be used to obtain high-quality predictions of particular conformations of proteins given sets of restraints obtained under different conditions, enabling direct monitoring of conformational states in solution and in situ.