Pathological macromolecular crystallographic data affected by twinning, partial-disorder and exhibiting multiple lattices for testing of data processing and refinement tools

Twinning is a crystal growth anomaly, which has posed a challenge in macromolecular crystallography (MX) since the earliest days. Many approaches have been used to treat twinned data in order to extract structural information. However, in most cases it is usually simpler to rescreen for new crystallization conditions that yield an untwinned crystal form or, if possible, collect data from non-twinned parts of the crystal. Here, we report 11 structures of engineered variants of the E. coli enzyme N-acetyl-neuraminic lyase which, despite twinning and incommensurate modulation, have been successfully indexed, solved and deposited. These structures span a resolution range of 1.45–2.30 Å, which is unusually high for datasets presenting such lattice disorders in MX and therefore these data provide an excellent test set for improving and challenging MX data processing programs.

Twinning is a crystal growth anomaly or lattice disorder in which the crystal is composed of separate domains of differing orientations 1 . Twinning has posed a challenge in macromolecular crystallography since the earliest days 2,3 and multiple computational approaches have been developed in order to treat twinned data in order to extract structural information. Several exhaustive reviews are available that discuss twinning and the methods to address it in detail 1,4-7 nevertheless, for clarity, we give here a brief description of this phenomenon. Twinning is characterised by the twin law (a set symmetry operators, which relate the different orientations of the domains); and the twin fractions, αι, that characterise the relative volumes of the twinning domains. There are several types of twinning: merohedral twinning (when the twin operators are a subset of the exact rotational symmetry of the lattice); pseudo-merohedral twinning (when the twin operators approximate the rotational symmetry of the lattice); and non-merohedral twinning or epitaxial twinning (when the twin operators have the rotational symmetry of a sublattice in three or fewer dimensions). In this paper we present examples of pseudo-merohedral twinning.
When there are two twin domains, and the twin operator is a 2-fold rotation, the twinning is called hemihedral twinning. When the twin domains are sufficiently large, the diffracted waves from these domains do not interfere (or interference is negligible, depending on the coherence radius of the beam and twin domain sizes) and the observed intensities are simply the weighted sum of the intensities from each of the individual domains 8 . If the twin fraction α approaches 0.5 the diffraction pattern acquires an additional symmetry, imposed by the twinning operator, which may lead to erroneous indexing in a higher symmetry space group. If the twin fraction equals 0.5, the crystal is perfectly twinned and the intensity measurements cannot be deconvoluted. If the twin fraction is <0.5 it is possible to deconvolute the data in order to recover the untwinned intensities 1 . However, errors in the deconvoluted intensities increase proportionally and can become a very large fraction of the intensities as the twin fraction approaches 0.5. Twinning can thus hamper crystal structure determination at all stages, from indexing, to data reduction, phase determination and refinement.
Since the intensities of twin related reflections are correlated, twinning reduces the information content of the data. In the limit case of perfect merohedral twinning that reduction is equivalent to a reduction in the resolution limit by a factor of 1.26. An additional complication is that the statistical properties of the data from twinned and untwinned crystals are different and therefore overall statistics describing model quality such as the R factor /R free must be interpreted with extra care 9 . In particular, the gap between R factor and R free values as well as their individual values needs to be monitored during refinement. If refinement using the twin option leads to an increase of the gap between R factor and R free , this indicates a serious problem with the refinement protocol and data handling.
Another type of deviation from perfect periodicity in a crystal, is crystal modulation, in which the content of asymmetric unit is not perfectly replicated by the lattice operations and which can occur with a period commensurate or incommensurate with the lattice periodicity. As result of crystal modulation, primary Bragg reflections are flanked by off-lattice satellite reflections 10 . The direction and magnitude of such satellite reflections is described by an additional vector q, which needs to be added to the reciprocal space vector H to define a 4-dimensional reciprocal space vector. Although incommensurate crystals have been reported rarely in macromolecular protein crystallography 11,12 , the EVAL software suite can index and process such data 10,13 , and in silico simulations of modulated structure have been performed 14 .
In this report we present 11 diffraction data sets, in multiple space groups, from the E. coli enzyme N-acetyl-neuraminic acid lyase (NAL), which present twin lattices and incommensurate modulation. NAL is a tetramer in solution, that crystallises in low salt conditions 15 to give four different crystal forms, three in space group P2 1 and one in P2 1 2 1 2 1 15-17 . Interestingly, the three crystal forms in space group P2 1 were not related to each other, two of them were twinned and shared the same twinning operator, which made the monoclinic cells a pseudo-orthorombic cell.
Two of the crystal forms are reported here for the first time, and some were pseudo-merohedrally twinned with the additional complication of incommensurate modulation. Although they could all be solved by molecular replacement, they could not be refined satisfactorily using standard protocols. However, with improvements in REFMAC5, one of the software packages for macromolecular structure refinement available from ccp4 suite 7.0 18 , with direct contribution from the presented test cases, we were able to refine models satisfactorily against all 11 datasets. Due to the varied diffraction data pathologies (pseudo-merohedral twinning with α up to 0.497 as well as crystal modulation) we believe these data form a useful test set for the development of macromolecular crystallographic data processing and structure refinement software and therefore we made them available to the community through the public repository Zenodo (public links in Data Records).
The diffraction patterns occasionally showed spot splitting in all four crystal forms of NAL and it was not possible to predict the successful indexing and scaling outcome based on the observed diffraction quality alone ( Fig. 1). The main and satellite reflections are clearly distinct and the main lattice could be indexed separately while satellite reflections were ignored by MOSFLM 19 Fig. 1).
Closer inspection of the diffraction pattern of four of the seven datasets in crystal form I with DIALS viewer 20 revealed two lattices but also some extra reflections, which did not belong to either lattices (Fig. S2). MOSLFLM successfully indexed the main lattice in all cases (Table 1), but we decided to further investigate whether these extra reflections in the diffraction pattern could be caused by crystal modulation, as they appeared to be occurring in a periodic manner.
All the datasets in crystal form I were therefore indexed with Dirax 21 to determine whether incommensurate modulation was present. This was indeed the case for four of the seven datasets, three of which were deposited in the PDB: 2WNN, 2WNQ, 2W05, whilst one, called Y137A, was not, due to unsatisfactory statistics. In those cases, reflections could be indexed and assigned either to the main lattice or to the satellite reflections with order m = −1 or 1 (see 2WNN as example in Fig. 2). No evidence of splitting of the main lattice was found, implying that the pseudo-merohedral twinned lattices almost exactly overlap. The data were processed with Eval 10 and scaled with SADABS 22 in 2 /m point group symmetry. The resulting statics are shown in Table 2.
All the modulated structures appeared also to be partially twinned (Tables 1 and 2). We speculate that the lack of modulation in 2WNZ and 4BWL is probably due to the larger unit cell axis a, which is large enough not to be incommensurate. With the P2 1 indexing choice, POINTLESS initially assigned the space group C222 1 but reflections belonging to one of the 2-fold axes were much stronger than the others (data not shown), which is consistent with pseudo-merohedral twinning in P2 1 , and indeed with this choice the structures could be easily solved.
However, in all the crystal forms, space group attribution was difficult or sometimes impossible and the choice of the point group was made based on the R meas values 23 . Weak molecular replacement solutions could also be obtained in multiple space groups. As a general rule, whenever only a single lattice with no incommensurate modulation was present, indexing, data reduction and molecular replacement were possible, but the (non-twin) refinement stalled at R factor and R free values of 30-35% for all datasets (resolution range 1.45-2.3 Å, <I/σ(I)> cut off = 2.0) where we would expect R factor values near or below 20% for well-behaved refinements.
Twinning analysis. H and L twinning tests, as implemented in TRUNCATE 18 , were used as diagnostic tools for twinning. In our experience the L-test prediction was more consistent with estimates of twinning fraction performed internally in REFMAC. This is probably due to the fact that H and L tests are affected by experimental errors and lack discrimination power if one of the NCS-operation axes is parallel to twin-operation axis. However, the H-test requires for data to be merged in correct point group, and even then, in case of the NCS, it may seem to indicate partial twinning for data from single crystal. L-test is free from these two issues. For these reasons only the L-test is reported for the presented datasets (Fig. 3).
Micro-seeding techniques were employed in an attempt to avoid twinning by growing larger single crystals 24 . However, twinning persisted, suggesting that it was likely to be a nucleation phenomenon, which was perpetuated when twinned seed crystals were used as nuclei. Diffraction data were collected at 100 K following flash cooling of crystals in cryo-protectants, which could have been a source of lattice disorder. Data collection at room temperature from multiple crystals, however, also showed both split diffraction and significant twinning (data not shown), indicating that the disorder pre-existed in the crystals. Ligand soaking experiments were similarly excluded as a cause of the twinning.
Refinement with the program REFMAC (versions 5.6 and 5.7) identified the twinning operator (−h, −k, h + l) for all the cases, in which twinning was detected. Twin refinement resulted in improved models with R factor and R free values ~18-20% (Table S2, data collection and final refinement statistics are summarised in Table 1). This improvement of the R factor quality indices was accompanied by local improvements of the electron density maps, which became better defined and showed increased connectivity (Fig. 4). The best refined model for each crystal form was validated using ZANUDA 18 , which confirmed the space group assignment in all cases by transforming the individual space group into the lower symmetry space groups, followed by refinement of the corresponding models using REFMAC and selection of the model with highest symmetry from the ones with best refinement statistics.  Table 1. Improvement of refinement statistics upon applying the twin option in REFMAC. *As defined in Nespolo et al., 35 . **THB refers to the competitive inhibitor (2 R,3 R)-2,3,4-trihydroxy-N,N-dipropylbutanamide, as reported in Campeotto et al. 16 . ***Refinement statistics were not of enough quality for model and data deposition, although data analysis was beneficial for the discussion presented here and the raw images were deposited in the public Zenodo database. Crystal packing analysis. Zanuda was used to expand the final refined models into space group P1 in order to compare packing in the different crystal forms. Inspection of the packing using the molecular graphics program COOT 25 highlighted how not only the inter-monomer contacts within the NAL tetramers were different, but also the inter-tetramer contacts in the crystal lattice (Fig. 5). We speculate that the likelihood of NAL of crystallising in any one of the four forms is determined by small differences in the interfaces between tetramer during nucleation and the early stages of crystal growth. This process is kinetically and thermodynamically difficult to control and attempts to select for a specific crystal form were hindered by the fact that all four forms were obtained in the same crystallisation drops and therefore from identical crystallisation conditions. Surface accessible areas and free energies of interaction were calculated using PISA (Table 2) 26 . These did not show any significant differences in the strength of intra-tetramer interactions between the different crystal forms, consistent with our observation that all four crystal forms appeared in the same crystallisation drops.
Future developments. The presented datasets were the result of an extensive screening at the data collection stage and of an extensive processing at the data reduction and data refinement stages with very low success rate (Fig. S3). The development of twin refinement in REFMAC, which at the time was only implemented in the experimental version of the program, allowed the determination of several apo-and ligand bound structures of NAL and the proposal of the first detailed mechanism of the enzyme reaction 15,16 . Although twin refinement is currently included in REFMAC, the presented datasets are still a challenging test for current indexing and scaling programs, including iMosflm, LABELIT 27 and XDS 28 , and they therefore offer an excellent opportunity for the development of these softwares. Several improvements in MX software are still very desirable in the part of dealing with pathological data. This includes robust diagnostics and warning messages, automated space group assignment in at least obvious cases of twinning, and, importantly, robust integration of partially overlapping reflections and communication of all the necessary data and metadata to a refinement program. Crystal modulation was also detected only after structure deposition and although this had no effect on data processing in the presented cases, its diagnosis should be implemented to avoid reflection overlaps, which in severe cases can seriously hamper indexing, data reduction and ultimately phasing and satisfactory refinement.

Methods
Data collection and structure solution. We have previously reported several structures of wild type NAL and engineered variants and NAL crystals were obtained as previously described 15,16 . NAL crystals are plateshaped and tend to grow in clusters and therefore micro-seeding experiments were required to obtain single large crystals. Crystal cryo-protection was achieved by serial transfer of the crystals through mother liquor containing 20% and then 25% v/v PEG 400, with 2 minutes soak time at each step. Eleven datasets were collected from single crystals at Diamond Light Source (beamlines I02, I03 and I04), at 100 K with a 1 s exposure and an oscillation of 0.5° per image and using a Q315 ADSC CCD detector. Data were processed using iMOSFLM and scaled and merged using SCALA 29 .
In the case of the datasets of crystal form I, diffraction patterns were inspected with DIALS 20 for the presence of satellite reflection, indexed with Dirax 22 and processed with EVAL15 10 . Scaling was performed with SADABS in 2/m point group symmetry. The results are shown in Table 2. For structure refinement only the main lattice reflections from MOSFLM were used, ignoring the weak satellite reflections.
In each dataset five percent of the reflections were excluded from the refinement and constituted the R free set. A new R free set was generated randomly for each new crystal form and then transferred to all datasets belonging to the same crystal form.
The first crystal structure obtained for each crystal form was solved by molecular replacement using PHASER 30 and 1NAL as a starting model 31 , while refinement against other datasets of the same crystal form started with 20 cycles of rigid body refinement (resolution range 10.0-6.0 Å) followed by 10 cycles of preliminary restrained refinement (whole resolution range) in REFMAC5. Refinement and Crystal packing analysis. Refinement was performed using REFMAC 5.6 or 5.7 (i.e. the latest version at the time of deposition or final refinement for each structure), with and without twin refinement, both for electron density calculations and evaluation of statistics. Refinement was performed with the same settings for all reported structures, i.e. 20 cycles per run (using the whole resolution range of the data), a weight matrix of 0.1 32 , with riding hydrogen atoms.
For all structures involved, regardless of whether the unit cell parameters allowed for twinning by merohedry or not, the refinement protocol was identical and included twin-refinement in the final refinement rounds 32 . If no twinning operations are present, the twin refinement option means that REFMAC uses approximation to the likelihood target rather than its exact version. Such usage therefore only makes sense for comparison of refinement results for twinned and untwined crystals. R factor and R free values were compared before and after twin refinement. The values of the obliquity angle, which are a measure of pseudo-symmetry, were monitored and manual inspection of the diffraction pattern were performed with ADXV 33 .
The concept of obliquity is a measure of the overlap of lattices on the individuals forming a twin and Friedel provided a formal mathematical description since the early day of crystallography 34 . Briefly, the closest is the obliquity angle to zero, the more likely is the presence of merohedral twinning 35 , as the two twin lattices tend to overlap. Values of obliquity close to zero are, however, only a possible indicator that twinning may be present but not a fixed rule, as some of the presented datasets highlight. For instance, in the case of crystal form I and III, the obliquity angle is small enough to allow twinning in some cases, whilst in crystal form II is too large for twinning to occur (Table 1).
Manual model building was performed in COOT. Zanuda was used to expand the unit cell of each crystal form into P1 for each crystal form and these were refined against the data processed in P1 in order to confirm the correctness of the space group assignment in each case.
In order to assess how the four crystal forms of NAL were related to each other, Csymmatch from CCP4 was used to bring all the P1-expanded structures to the same origin and NCONTACT 18 was used to calculate inter-tetramer contacts. The input file from NCONTACT was used in PYMOL to visualise the contact surface between monomers. Surface accessibility areas and crystal contact energy were calculated using PISA 26 (Table S3).

Data Records
The datasets (raw diffraction images) discussed in this manuscript have been deposited in the publicly available database zenodo at, https://doi.org/ 10.5281/zenodo.54568 and 10.5281/zenodo.1240503. Structural models and processed structure factor data deposited in the PDB are available under the accession codes given in Table 1, with the exception of dataset Y137A, as the R factor indices were not satisfactory for PDB deposition.