Automated NMR resonance assignments and structure determination using a minimal set of 4D spectra

Automated methods for NMR structure determination of proteins are continuously becoming more robust. However, current methods addressing larger, more complex targets rely on analyzing 6–10 complementary spectra, suggesting the need for alternative approaches. Here, we describe 4D-CHAINS/autoNOE-Rosetta, a complete pipeline for NOE-driven structure determination of medium- to larger-sized proteins. The 4D-CHAINS algorithm analyzes two 4D spectra recorded using a single, fully protonated protein sample in an iterative ansatz where common NOEs between different spin systems supplement conventional through-bond connectivities to establish assignments of sidechain and backbone resonances at high levels of completeness and with a minimum error rate. The 4D-CHAINS assignments are then used to guide automated assignment of long-range NOEs and structure refinement in autoNOE-Rosetta. Our results on four targets ranging in size from 15.5 to 27.3 kDa illustrate that the structures of proteins can be determined accurately and in an unsupervised manner in a matter of days.

(a) For every TOCSY AAIG (red, center) the correlated 13 C-1 H frequencies (i-1) are matched to the correlated 13 C-1 H frequencies (i and others) of any other NOESY AAIG (black). The occupancy rate of matched frequencies differs among the NOESY AAIGs (blue arrows). The higher the occupancy rate the more likely the sequential connectivity to be correct. For visualization purposes the carbon frequencies of folded peaks have been unfolded manually. (b) The connectivity information is used to generate a directed rooted tree from AAIG X100, here shown for a chain length of 3. The orange directed edges represent the correct path in the sequential assignment problem.

A N I V G G I E Y S I N N A S L C S V G F S V T R G A T K G F V T A G H C G T V N A T A R I G G A V V G T F A A R V F P G N D R A W V S L T S A Q T L L P R V A N G S S F V T V R G S T E A A V G A A V C R S G R T T G Y Q C G T I T A K N V T A N Y A E G A V R G L T Q G N A C M G R G D S G G S W I T S A G Q A Q G V M S G G N V Q S N G N N C G I P A S Q R S S L F E R L Q P I L S Q Y G L S L V T G
TOCSY & NOESY (99.2%)

M G Q V S A V S T V L I N A E P A A V L A A I S D Y Q T V R P K I L S S H Y S G Y Q V L E G G Q G A G T V A T W K L Q A T K S R V R D V K A T V D V A G H T V I E K D A N S S L V S N W T V A P A G T G S S V N L K T T W T G A G G V K G F F E K T F A P L G L R R I Q D E V L E N L K K H V E G
CBHBCAHA(CO)NH & NOESY (97.1%)

M G Q V S A V S T V L I N A E P A A V L A A I S D Y Q T V R P K I L S S H Y S G Y Q V L E G G Q G A G T V A T W K L Q A T K S R V R D V K A T V D V A G H T V I E K D A N S S L V S N W T V A P A G T G S S V N L K T T W T G A G G V K G F F E K T F A P L G L R R I Q D E V L E N L K K H V E G
TOCSY & NOESY (100.0%)

S E Q F T T K L N T L E D S Q E S I S S A S K W L L L Q Y R D A P K V A E M W K E Y M L P R S V N T R R K L L G L Y L M N H V V Q Q A K G Q K I I Q F Q D S F G K V A A E V L G R I N Q E F P R D L K K K L S R V V N I L K E R N I F S K Q V V N D I E E R S L A A A L E H
RTT: 134 residues ms6282: 145 residues aLP: 198 residues nEIt: 248 residues Supplementary Figure 5. Assignment statistics for all protein targets using 4D-CHAINS or FLYA. Quality of assignments using 4D-CHAINS for TOCSY-NOESY, Cα/β-NOESY, and NOESY input data, or FLYA input data. Input data explanation for the algorithms: TOCSY-NOESY, perform NH-mapping and aliphatic assignments using 4D-HCNH TOCSY and 4D-HCNH NOESY; Cα/β-NOESY, perform NH-mapping and aliphatic assignments using 4D CBHBCAHA(CO)NH synthetic data and 4D-HCNH NOESY; NOESY, perform aliphatic assignments using fixed 1 H, 15 N HSQC assignments and 4D-HCNH NOESY; FLYA, perform NH-mapping and aliphatic assignments using 4D-HCNH TOCSY, 4D-HCNH NOESY, and 4D-HCCH NOESY.          Based on fixed 15 N-1 H frequencies 4D-CHAINS has clustered 18 NOE peaks belonging to L330 amide (black) and 13 NOE peaks belonging to T331 amide (red). The question in place is to assign the 13 C-1 H aliphatic frequencies of L330. To do so, 4D-CHAINS first identifies the common NOEs L330 amide shares with the next amide in protein sequence, that is T331. The total number of common NOEs is 8, indicated by gray arrows. For the common NOEs, 4D-CHAINS derives probabilties for each of them belonging to certain atom types of Leu from the 2D probability map, shown in the background. Each probability is modified by the intensity of the corresponding peak to a score. The highest product of scores provides the assignments for the missing atom types. In this case all atom types are assigned correctly (labeled arrows). For visualization purposes the carbon frequencies of folded peaks have been unfolded manually.   Figure 9. Schematic overview of the novel, fully automated, approach to NMR assignments and structure of large protonated proteins. The 4D-HCNH TOCSY spectrum allows for spin system identification by coupling the preceding aliphatic 13 C-1 H resonances to the backbone amide. The 4D-HCNH NOESY spectrum, on the other hand, reports through-space correlations for a backbone amide, including mainly the intraresidue ones. The novel algorithm 4D-CHAINS uses 2D probability density maps of chemical shifts to first identify the spin systems, and then match spin system information between the two spectra (TOCSY-NOESY). The advantage is that the correlated chemical shifts of 13 C-1 H moieties establish robust connectivities for resonance assignment in one shot (sequential and sidechain) and reduce ambiguities in downstream NOE-based structure calculation. 4D-CHAINS automated assignments are then passed to autoNOE-Rosetta, together with the peaklists of two 4D NOESY spectra (HCNH and HCCH), for unsupervised NMR structure determination of large protonated proteins.

Supplementary Figure 9
connectivities protein sequence values of 10 lowest energy structures obtained from RASREC/CS-Rosetta and autoNOE-Rosetta structure calculations, compared to the closest PDB deposited reference structures. The reference structures used for RTT is a solution NMR structure with 100% sequence identity (PDB ID 2KM), for ms6282 is a crystal structure with 30% sequence identity (PDB ID 4PSB), for aLP is a crystal structure with 100% sequence identity (PDB ID 1P01) and for nEIt is a solution NMR structure with 45% sequence identity (PDB ID 1EZB). Diagonal line indicates equal R.M.S.D. values of ensembles calculated using the two methods. Adjacent dashed lines indicate R.M.S.D. values of RASREC/CS-Rosetta and autoNOE-Rosetta ensembles within 2Å from one another. Points below the diagonal correspond to lower R.M.S.D. values for ensembles calculated using autoNOE-Rosetta relative to RASREC/CS-Rosetta. The area of the circles in the plot is proportional to the size of proteins (number of residues). The autoNOE-Rosetta ensembles were calculated using supervised resonance assignments and both HCNH+HCCH NOE peak lists. using RDCs, in addition to chemical shifts and NOEs. The bars indicate the total numbers of assigned long range HCNH (amide to aliphatic) and HCCH (aliphatic to aliphatic) restraints, including ambiguous restraints obtained for different stereo-specific groups. Images of structural ensembles were produced using Chimera (https://www.cgl.ucsf.edu/chimera).   Figure 12. Frequency of the buried residues in 10 lowest-energy structures with correct χ 1 dihedral placements. All the panels represent frequencies of the residues in 10 lowest-energy models, whose dihedral placements (χ 1 ) are within 30° (considered correct) from the corresponding rotamers in the X-ray structure. The buried residues were calculated using 10 Å 2 solvent accessible surface area threshold and further filtered based on secondary structure (α-helix and β-sheet) to retain only the core, rigid residues. Of the core, rigid residues only those residues were used whose side chain conformers were consistent with crystallographic ensemble of 6 structures (PDB IDs 1P01, 1QRX, 1SSX, 1TAL, 2ALP, 2H5C and 2ULL). The top panel was obtained for the structural ensemble calculated using supervised assignments with HCNH+HCCH NOEs. The middle and the bottom panels were obtained for the structural ensembles calculated using 4D-CHAINS TOCSY-NOESY automated assignments with HCNH and HCNH+HCCH NOEs. Solvent accessible surface area was calculated using PyMOL software (https://www.pymol.org). All the panels represent number of total long-range HCCH (orange) and HCNH (grey) NOE restraints predicted by autoNOE-Rosetta and CYANA using supervised assignments (black) and automated 4D-CHAINS assignments derived from TOCSY-NOESY spectra (purple). . The autoNOE-Rosetta and CYANA ensembles were calculated using supervised (yellow) and automated (4D-CHAINS using TOCSY-NOESY spectra) (gray) resonance assignments and both HCNH+HCCH NOE peak lists.  Missing refers to aliphatic carbon types that could not be assigned from TOCSY or Cα/β synthetic data to yield 100% completeness of aliphatic chemical shifts for the four protein targets. For TOCSY-NOESY and Cα/β-NOESY, 4D-CHAINS performed correctly the mapping of backbone amide frequencies to the protein sequences, whereas for NOESY, the 1 H, 15 N HSQC frequencies were fixed. 4D-CHAINS performance is shown when using only 2D probability heat maps or a combined function that takes into account the corresponding relative peak intensities in obtaining atom type assignments from the 4D-HCNH NOESY spectrum. Combined error rate corresponds to the overall 4D-CHAINS assignment performance using a combination of TOCSY-NOESY or Cα/β-NOESY.    By default, 4D-CHAINS automated assignments are derived from two spectra (4D-HCNH TOCSY, 4D-HCNH NOESY). FLYA automated assignments were produced using three spectra as input (4D-HCNH TOCSY, 4D-HCNH NOESY, 4D-HCCH NOESY).

Tutorial
To run 4D-CHAINS use the 4Dchains.py script. You can do all operations you wish with this script as long as you provide the appropriate protocol file, e.g.
4Dchains.py -protocol protocol.txt You can find example files for nEIt (248 residues) protein under the "tutorials/nEIt/" folder. In that folder you will find the protocol file "nEIt_protocol.txt", which contains all the input parameters for the program, and the following input files: nEIt.fasta is the sequence of the protein in fasta format nEIt_HSQC.list is the {N-H}-HSQC peak list in sparky format which consists of 3 columns: <label> <N resonance> <HN resonance> nEIt_TOCSY.list is the 4D-TOCSY peak list in sparky format which consists of 5 columns: <?-?-?-?> <H resonance> <C resonance> <N resonance> <HN resonance> nEIt_NOESY.list is the 4D-NOESY peak list in sparky format which consists of 6 columns (the last column can be omitted, but will decrease the accuracy of NOESY-based assignments): <?-?-?-?> <H resonance> <C resonance> <N resonance> <HN resonance> <peak intensity> nEIt_TOCSY.list.curated this file is optional, it is the same as "nEIt_TOCSY.list", but contains the supervised peak assignments for proofreading of 4D-CHAINS automatic assignments nEIt_NOESY.list.curated this file is optional, it is the same as "nEIt_NOESY.list", but contains the supervised peak assignments for proofreading of 4D-CHAINS automatic assignments The protocol file consists of compulsory and optional directives. The compulsory directives must be defined in order 4D-CHAINS to run, whereas the optional can be omitted. Briefly, the available protocol file "nEIt_protocol.txt" will perform sequential and sidechain assignments of the protein using a 4D-TOCSY and a 4D-NOESY peak list. First, the N-H resonances are mapped to the protein sequence. These N-H resonances are then used to assign the aliphatic frequencies of the 4D-TOCSY peak list. Finally, the assignments from 4D-TOCSY are transferred to the respective 4D-NOESY peaks, and any missing aliphatic assignments are derived from the NOESY peak list. A short description of the most important directives is given below (for a detailed description of all available directives refer to the manual in the github repository).
fasta is the fasta sequence file.
HSQC is the {N-H}-HSQC peak list file.
4DTOCSY is the 4D-TOCSY peak list file.
4DNOESY is the 4D-NOESY peak list file.
user_4DTOCSY_assignedall is the optional 4D-TOCSY peaklist file with the supervised assignments. If this file, along with "user_4DNOESY_assignedall", is provided, then 4D-CHAINS will proofread the automated TOCSY assignments and will add labels (<CORRECT>, <WRONG>).
user_4DNOESY_assignedall is the optional 4D-NOESY peak list file with the supervised assignments. If this file, along with "user_4DTOCSY_assignedall", is provided, then 4D-CHAINS will proofread the automated NOESY assignments and will add labels (<CORRECT>, <WRONG>).
user_4DTOCSY_assignedall is the optional 4D-TOCSY peaklist file with the supervised assignments. If this file, along with "user_4DNOESY_assignedall", is provided, then 4D-CHAINS will proofread the automated TOCSY assignments and will add labels (<CORRECT>, <WRONG>).
user_4DNOESY_assignedall is the optional 4D-NOESY peak list file with the supervised assignments. If this file, along with "user_4DTOCSY_assignedall", is provided, then 4D-CHAINS will proofread the automated NOESY assignments and will add labels (<CORRECT>, <WRONG>).
The following compulsory directives define the number of cycles for NH-mapping. All of them must have a consistent number of values separated by ",". In each cycle, 4D-CHAINS does iterative NH-mapping using each time peptides of different length. It starts the first iteration by building long peptides, performs the NHmapping, and then uses the mapped Amino Acid Index Groups (AAIGs) as restraints for the next iteration, which is conducted using shorter peptides. Each cycle consists of the number of iterations that is necessary to bring the peptide length form first_length to last_length. As such, the following values first_length=6, 6,6,6,6 last_length=4,4,4,4,3 instruct 4D-CHAINS to run 5 cycles, the first 4 cycles consist of 3 iterations, one with 6mer peptides, one with 5mers and one with 4mers, whereas the 5 th cycle consists of 4 iterations, one with 6mers, one with 5mers, 4mers and finally one with 3mers.
In each cycle you can control the values of the following parameters: mcutoff is a floating point number between 0.0-1.0 (percentage) that controls the occupancy rate of TOCSY-NOESY connectivities used in the buildup of chains.
zmcutoff is a floating point number (Z-score) that controls how many of the connected NOESY AAIGs satisfying the given mcutoff will be retained for chain formation.
zacutoff is a floating point number specifying the lower Z-score for an amino acid type prediction to be considered as valid.
To run the full protein assignment for the nEIt, type: 4Dchains.py -protocol nEIt_protocol.txt From all the output files, the most important is "4DNOESY_assignedall.proofread.xeasy", which will be passed as the input "chemical shift file" to autoNOE-Rosetta for protein structure prediction, as described below.

FLYA resonance assignment
Automated resonance assignments with FLYA were performed using the FLYA.cya script shown below.

autoNOE-Rosetta
The following series of commands can be followed to set up and analyze Rosetta calculations exactly as was done for this benchmark.

Preparing files
Three initial files are required: a fasta file, chemical shift file (i.e. myShifts.xeasy), and NOE peak list(s).
From these, the user may convert to the required formats using tools provided in the CS-Rosetta Toolbox.
. The .prot will be used in the autoNOE protocol for NOE peak matching, while the .tab format is required for use with the TALOS-N program, included with the CS-Rosetta Toolbox, which is used to assess trimming of the termini, as well as fragment picking.

Trim flexible ends
talosn -in [myShifts.tab] This output helps the user assess which residues are flexible and might consider trimming from the termini.
Should the user decide to trim any residues, the steps above may be repeated to update the .prot and .tab files. The fasta should also be trimmed to reflect the adjusted sequence. where -skip indicates how many lines come before the first NOE peak in the original file, -cols specifies which columns are relevant, and -names indicates the aforementioned columns. For instance, here we qualify columns 2, 3, 4, 5 as being chemical shifts pertaining to h, c, N, and H, and column 8 as the intensity of the peak. The flag -tol specifies the tolerance of each chemical shift column, respectively. Each peak list may be converted in this way, and multiple peak lists may be used with the autoNOE protocol.

Select 3-mer and 9-mer backbone fragments
Fragments are picked using the chemical shift information contained in the .tab file. Fragments used in this benchmark were picked using the optional flag -nohom, which excludes fragments derived from homologous structures. The output will contain a 3-mer and 9-mer fragment file, each ending in ".dat.gz".

Setting up directories and flag files
There are two commands used for setting up the directories and additional files needed for an autoNOE run: setup_target, and setup_autoNOE. The first command creates a directory with all the input files systematically organized, while the latter creates the run directory and automatically generates the flag files required for the Rosetta calculations and automatic assignment of NOE peaks.

Initializing NOE assignments
Before the autoNOE protocol is started, initial NOE assignments are created using the command "source initialize_assignments_phaseI.sh" from within the run sub-directory of each restraint-weight directory. This will create four peak assignment files: noe_auto_assign.cst.centroid, noe_auto_assign.cst.filter.centroid, noe_auto_assign.cst, and noe_auto_assign.cst.filter. To initialize the assignments of restraint-weight directories, the following commands may be used: cd cst_XX/run source initialize_assignments_phaseI.sh

Start run
Calculations are started from within each run sub-directory. It is recommended to run calculations using at least the four default restraint weights to assess which works best for a given target. For instance, depending on the quality or confidence of initial input, one may see better results with either harsh or more lenient weights. For this benchmark, we tested all default weights (plus weight 100 for nEIt), and found that higher restraint weights worked best given the high quality of the automatically assigned chemical shifts, namely weights 25, 50 and 100. The command used to begin the Rosetta calculations depends on the parallelization environment used. An example using the OpenGrid Sun Grid Engine parallel environment is as follows: qsub -pe orte 100 myJobScript.production where 100 is the number of cores to be used in parallel for this run.

Processing and analyzing a run
Once all runs for a given target have completed (i.e. all restraint weight run have completed), a quick check for the best performing run can be performed using the following command from within autoNOE/myTarget/: autoNOE_select_final_run Further documentation of autoNOE_select_final_run may be found at http://csrosetta.chemistry.ucsc.edu.
A more detailed analysis can be performed using a variety of tools offered with the CS-Rosetta Toolbox.

Select the 10 lowest scoring structures
Below is an example of rescoring decoys from within the fullatom_pool sub-directory: score_jd2.macosclangrelease -in:file:silent decoys.out -evaluation:rmsd_target myKnownStructure.pdbout:file:silent rescored_decoys.out To extract the lowest ten scoring decoys based on the total Rosetta energy scores: extract_decoys rescored_decoys.out -formula 'score-atom_pair_constraint-rdc' -N 10 -verbose 0 > rescored_low_10.out To pack ten lowest-energy structures into a PDB bundle: Final static NOE assignments and statistics can be generated with the two following commands: source final_assignments.sh ../fullatom_pool/rescored_low_10.out noeout2txt -peaks final_assignment/NOE_final.out -split_level 0

Converting to NEF for data deposition
After the structure calculations are complete and are ready to be deposited to a database, NMR restraint data that was used for structure calculation can be converted to NEF format and uploaded to the database using following commands: