An automated data processing and analysis pipeline for transmembrane proteins in detergent solutions.

The application of small angle X-ray scattering (SAXS) to the structural characterization of transmembrane proteins (MPs) in detergent solutions has become a routine procedure at synchrotron BioSAXS beamlines around the world. SAXS provides overall parameters and low resolution shapes of solubilized MPs, but is also meaningfully employed in hybrid modeling procedures that combine scattering data with information provided by high-resolution techniques (eg. macromolecular crystallography, nuclear magnetic resonance and cryo-electron microscopy). Structural modeling of MPs from SAXS data is non-trivial, and the necessary computational procedures require further formalization and facilitation. We propose an automated pipeline integrated with the laboratory-information management system ISPyB, aimed at preliminary SAXS analysis and the first-step reconstruction of MPs in detergent solutions, in order to streamline high-throughput studies, especially at synchrotron beamlines. The pipeline queries an ISPyB database for available a priori information via dedicated services, estimates model-free SAXS parameters and generates preliminary models utilizing either ab initio, high-resolution-based, or mixed/hybrid methods. The results of the automated analysis can be inspected online using the standard ISPyB interface and the estimated modeling parameters may be utilized for further in-depth modeling beyond the pipeline. Examples of the pipeline results for the modelling of the tetrameric alpha-helical membrane channel Aquaporin0 and mechanosensitive channel T2, solubilized by n-Dodecyl β-D-maltoside are presented. We demonstrate how increasing the amount of a priori information improves model resolution and enables deeper insights into the molecular structure of protein-detergent complexes.

www.nature.com/scientificreports www.nature.com/scientificreports/ available in the ATSAS package 34 and, in addition, calls the program MEMPROT 28 . The primary data analysis is executed directly upon completion of a SEC-SAXS data acquisition and proceeds in the following workflow ( Fig. 1): (1) Radial integration of 2D SAXS images by RADAVER 33 (typically 1000-3000 frames are collected for a standard SEC-SAXS run set with one second exposure per image/frame). (2) Determination of the buffer and sample ranges from the elution profile using CHROMIXS 35 , an important analysis component of the pipeline specifically designed for SEC-SAXS analysis. CHROMIXS automatically identifies both buffer and sample regions in the elution profile, and performs frame averaging and subtraction to produce the SAXS curve of the purified component for subsequent analysis (Fig. 2). It is worth noting that CHROMIXS is capable of identifying multiple elution peaks, providing the users with a set of distinct SAXS curves from individual fractions, if present. Here we consider the simple case, where a single peak corresponding to PDC is identified automatically by CHROMIXS. (3) Assessment of the overall quality of the SEC-SAXS data and useability of the sample. Reliable SEC-SAXS data from a PDC suitable for further analysis should fullfil several requirements: i) the main peak in the CHROMIXS elution profile should correspond to a monodisperse fraction only, yielding similar scattering profiles across the entire sample region of the chromatogram; ii) the final buffer subtracted 1D SAXS profile from the PDC must satisfy the standard Guinier criteria (eg. 0.5 < sR g < 1.3) 36,37 . These important points have been integrated into a single figure-of-merit parameter K: Here c v aver is the coefficient of variation ( = σ µ c v , where µ is the mean value and σ is the standard deviation) of the R g value derived from the curve generated by CHROMIXS (averaged and buffer subtracted) employing Guinier approximation; <χ 2 > denotes the mean square weighted deviation of the frames across the chromotographic peak with respect to the peak frame with maximum integral intensity: where n is the number of points in the frame and m is the number of frames in the peak. I p (s) and σ p (s) denote the intensity and standard deviation of the frame with maximal integral intensity, and I l (s) and σ l (s) are intensities and standard deviations of all the other frames across the candidate monodisperse region; c l is the scaling coefficient; = π θ λ s 4 sin (where θ is the scattering angle and λ is the wavelength) is the scattering vector. For the statistically identical scattering profiles, the <χ 2 > value tends to unity, whereas the c v aver value for the good-quality averaged SAXS profiles tends to zero. The K value integrates the two contributions into a single criterion to characterize the overall quality of a SEC-SAXS dataset. In practice, K values < 10% indicate reliable data quality; K ≥ 30% indicates that the data is of low quality and should be treated with caution. AMPP generates a corresponding notification and/or warning message based on this analysis to aid the user in data interpretation. In the case of the elution of multiple peaks, for example those corresponding to the PDC and also that of empty detergent micelles, a K value is determined for each SAXS profile generated. www.nature.com/scientificreports www.nature.com/scientificreports/ (4) Calculation of the overall parameters (radius of gyration R g , maximum size D max , excluded volume V Porod , molecular weight MW) and the generation of relevant plots (Guinier plot: ln I(s) vs s 2 ; Kratky plot: I(s)•s 2 vs s; and the real-space distance distribution function: P(r) vs r) for all subtracted curves. An example of these plots for the mechanosensitive T2 channel data is shown in the generated summary table (Fig. 3). These parameters are employed for the evaluation of the search volume to conduct the ab initio reconstruction using MONSA or for the hybrid MEMPROT fitting procedures. (5) Querying and collecting the available a priori information from the ISPyB database via dedicated Simple Object Access Protocol (SOAP) web services. Depending on the results, one of the three modeling paths is executed for the subsequent modeling ( Fig. 1).
Modeling Path 1: High resolution model of the protein and chemical formulae of the detergent heads and tails are available. Initial estimates of the length of the detergent tail and head components are done using the Tanford formula for the detergent tail length: l c = 1.5 + 1.265n c , where n c is the total number of carbon atoms. The starting parameters for the evaluation of the corona geometry using MEMPROT 28 are determined as follows: For a spherical (ellipticity, e = 1) detergent torus with the height (a) and cross-sectional axes (minor = b/e, major = b•e) equal to the length of a detergent tail (Tanford tail ), the geometrical parameters are set as: a = b = t = Tanford tail 28 (Fig. 4), where t is the thickness of the hydrophilic detergent head group. Although the hydrophobic tail of a detergent is typically slightly longer than its hydrophilic head, this assumption appears to be a reasonable first approximation for the further refinement.
AMPP employs the MEMPROT minimization procedure, refining the starting parameters within a limited range, estimated empirically based on several tests conducted on different MPs. The a, b and t parameters are searched in the ±1/3(Tanford tail ) range, while ellipticity and rotation parameters are kept fixed in order to speed up the computations. The hydrophobic radius of the model is taken as a + t = 2•Tanford tail , which is in a good agreement with the statistical mean value of transmembrane thickness D Transmemb ≈ 30 ± 3 Å, obtained from the OPM database 38 (https://opm.phar.umich.edu/types/1). The maximum value of D Transmemb is checked during the modelling such that it does not exceed 45 Å.
Modeling Path 2: Electron densities of a protein and chemical formulae of detergent heads/ tails are known, but the high resolution model of the protein is not available. The electron densities can be either provided directly via the ISPyB interface or calculated from the protein primary sequence (in FASTA format) or from the chemical formula using the built-in look-up tables 39,40 . Knowledge of the detergent chemical formula is mandatory as it provides additional information to constrain the geometry of the MP model. When the high resolution models are not available, ab initio low resolution bead modeling is performed www.nature.com/scientificreports www.nature.com/scientificreports/ by MONSA 27 , allowing for three phases with distinct contrasts corresponding to the protein (phase 1), detergent tails (phase 2) and detergent heads (phase 3). For each phase the contrast is calculated and the search space confined to a cylinder and two coaxial spherical hemi-tori (Fig. 5A), thus avoiding contacts between the hydrophobic regions and solvent in the final model. A spherical region is fixed for the protein phase in the vicinity of the origin of coordinates, while the rest of the cylindrical region is free to become either protein or solvent. Penalties for the model discontinuity and looseness are applied during the simulated annealing procedure as described in the original paper 27 . Three additional boundary regions between the phases (Fig. 5A) have a thickness of a few Angstrom allowing MONSA to select the optimum phase assignment during the fitting. The boundary beads (eg. between phases 1 and 2) are allowed to be assigned to either phase 1 or phase 2 (Fig. 5A). The model is sought within a spherical search volume of diameter d = D max , where the protein can be surrounded by two-stacked tori  www.nature.com/scientificreports www.nature.com/scientificreports/ (detergent tails and heads) with the lengths evaluated from the Tanford formula (Fig. 5B). The transmembrane diameter is calculated as: where Length (head) is considered to be the same as the Tanford (tail) , given that for the most common detergent molecules it is typically smaller or equal to that of the tail. The expected values for toroidal volumes and R g values of detergent phases are calculated on the fly from geometrical considerations 41 and serve as additional constrains while modelling. The total computed volume and R g of the particle is then compared to the experimental V Porod and R g , obtained from step 2. If available, information from complementary methods (e.g. phase volumes or phase R g ) can be used to improve the reliability of the final model. Using a spherical 40 beads-per-diameter MONSA grid, together with the above mentioned geometrical restrictions, low resolution models were obtained in a reasonable processing time about 10-15 minutes on a desktop PC (Intel Xeon CPU 3.7 GHz 8 Cores, 12.5 GB RAM). This algorithm is generally applicable to the reconstruction of both transmembrane and membrane-associated proteins. In the case of membrane-associated proteins one can simply shift the fixed "protein" phase to the edge/ surface of the detergent disc and extend the protein/solvent phase cylinder outside the entire detergent disc region.
Modeling Path 3: A priori information from ISPyB is not sufficient to generate the required constraints. In this case neither MONSA nor MEMPROT modeling is employed. The pipeline works in the standard SEC-SAXS mode and the low-resolution model is built using a single-phase approximation (phase corresponding to either PDC or solvent). The pipeline reconstructs low-resolution models from the low-s region (typically s max < 1.0 nm −1 ) of the experimental SAXS data ab initio, minimizing the contribution from the inhomogeneous internal structrure and describing only the overall particle shape. The value of s max for the initial reconstruction is automatically estimated from the first local minimum of the experimental SAXS curve. A standard ab initio modeling procedure is then executed using DAMMIF (as described e.g. in 33,34 ): DAMMIF generates 10 independent models and the normalized spatial discrepancy (NSD) is computed to find the model with the lowest variance in NSD with respect to the other members of the ensemble. This model is then used for both alignment/superposition of the ensemble and for the computation of an average model volume; finally, the pipeline employs this averaged model as input for a final refinement in DAMMIN. Therefore, the output of the pipeline in this regime is a single DAMMIN model that fits the data up to an automatically determined s max value.
We would like to stress that this approach has its limitations and the model represents a low-resolution approximation of the molecule's overall shape and possible high-resolution details on the surface should not be taken as www.nature.com/scientificreports www.nature.com/scientificreports/ experimental evidence. With these caveats, such ab initio modelling still offers a good low resolution estimate of the overall size and shape of the PDCs of interest in the absence of additional information.
(6) As soon as the overall SAXS parameters and MP models have been computed, the pipeline builds a final summary file in XML format, gathering together all the results in a compact and human/machine-readable form (Fig. 3). For the models generated, if the quality of the achieved fit is poor (χ 2 > 2.0), the pipeline flags this with a warning. This provides the user with a measure of the reliability of the on-line model(s) generated.
Upon completion the pipeline compresses the calculated 1D SEC-SAXS profiles into a single HDF5 file and reports them back together with the final model into the ISPyB database. The data from the buffer and sample regions, subtracted curves and overall SAXS parameters for each detected peak in the elution profile are displayed in a convenient tabular form (Fig. 3). It is worth noting that the summary table generated is applicable to all SEC-SAXS runs conducted at the P12 beamline and is not restricted to the measurement and analysis of MPs. For example, it is often the case that the sample exists in solution as a mixture of different states (eg. multiple oligomeric states of a protein or complex), which are separated chromatographically. In this case, AMPP will determine the shapes of individual components in the mixture using the modeling path 3. When chromatographic peaks corresponding to unloaded/empty detergent micelles are analyzed, their overall hollow appearance will be reconstructed by the DAMMIF single-phase modeling procedure, whereas the multi-phase distance-restricted MONSA model will fail to fit the data (see Supplementary Fig. S1). The pipeline results can therefore provide a hint to users who mistakenly measure empty/unloaded micelles at the beamline.
The results generated can be inspected online using the ISPyB web-server 26 . The ISPyB database and web services were extended to accommodate the information on chemical formulae and electron densities of detergents, as well as FASTA amino acid sequences and electron densities of the MPs and buffers. The SEC-SAXS elution profile and the final model can be visualized in the existing ISPyB GUI, as the pipeline output files are compatible with the required format.

Results and Discussion
To test the modelling components of the AMPP, we performed an automatic structural characterization of a tetrameric α-helical membrane channel Aquaporin0 and mechanosensitive channel T2 in DDM solution. The AMPP was utilized in all three pipeline regimes: i) with the high resolution model available ii) with detergent composition and the protein primary structure (FASTA sequence) known iii) plain ab initio modelling (no a priori information). The estimated K-factor (Eq. 1) for the T2 sample was 8.7% <10% indicating a good data quality. The subtracted published SAXS curve was taken for Aquaporin0 and thus no K-factor was determined. The models generated from these three modelling trajectories are displayed in Fig. 6 (as the protein was known to be tetrameric, P4 symmetry was utilized in MONSA and DAMMIF).
Expectedly, the automatically generated Aquaporin0 models display more adequate structural details with increasing amount of a priori information. The single phase based model generated ab initio by DAMMIF (Fig. 6C) provides only the overall shape roughly approximating that of the protein-detergent complex. This reflects the limitation of the ab initio shape determination procedure, which must assume constant density inside the particle. The more information-rich multi-phase MONSA and MEMPROT models (Fig. 6B,A) display a good level of similarity, with the hybrid model generated using the latter program being more detailed and arguably having a higher resolution. For the multi-phase MONSA modeling path the starting cylindrical search volume used for protein-solvent phases was considerably larger than the final model and contains a spherical core with the fixed protein phase (green area shown in Fig. 5B). However, after the minimization procedure the volume occupied by the protein phase becomes only slightly larger than the central fixed spherical volume, and the MONSA models do display protrusions resembling the four α-helical "legs" expected for Aquaporin0. These models are similar to those generated using the hybrid modeling path with MEMPROT, where the protein structure is assumed to be known.
The automatically generated models using the three modeling paths of the pipeline provide good fits to the experimental data at low angles, thus the overall particle shapes coincide and are well described by the data. A numeric evaluation of the proximity of the obtained models was performed using the measure of normalized spatial discrepancy (NSD) 42 , and the NSD comparisons indicate that the models are similar at low resolution ( Table 1). The deviations observed at higher angles (Fig. 7) reflect contributions from the detergents including the complex contrast situation, which is not fully captured by the simplified modeling procedures used. In particular, one of the simplifications we implemented for the MEMPROT modeling was to assume a horizontal detergent torus with a circular geometry, and a fixed ellipticity e = 1. Overall, the models generated by the pipeline do adequately describe the gross appearance of PDCs but caution is of course required regarding the internal structure. Further analysis is recommended in order to obtain more accurate and detailed models, and the initial parameters suggested by the pipeline can serve as a good starting point.
Similar results were obtained from the mechanosensitive channel T2 experimental data (Figs. 7b and 8). There is a high degree of similarity between the three types of models generated (Table 1), and interestingly, MONSA does reconstruct models with a void in the protein phase (green beads in Fig. 8). This void is maintained also when an increased penalty for the bead discontinuity during minimization is introduced. Upon inspection, the high resolution model of a homologous structure, the MscS ion channel of T. tengcongensis (PDB: 3T9N) also displays a void in the extracellular domain of the structure. Thus the multiphase ab initio reconstruction reliably generates features consistent with those expected for the T2 ion channel. The MONSA model of the T2 channel www.nature.com/scientificreports www.nature.com/scientificreports/ protein fits the data reasonably well up to s = 0.25 Å −1 . The on-the-fly generated MEMPROT model provides a satisfactory fit up to s = 0.1 Å −1 , corresponding to a real space resolution 2π/s ~ 60 Å.
It is also worth noting that atomistic models are often incomplete, missing eg. density for flexible loops which could not be resolved at high resolution. In general, to reliably reconstruct a PDC it is recommended to utilize both methods (full ab initio with MONSA, and hybrid with MEMPROT) and compare the results. Thus, the ab initio modelling is an important component of the automated MP modelling procedure, facilitating a direct comparison with the hybrid modelling and giving an insight into the possible structural organization of the PDC in solution.
In summary, an automated SAXS-based data analysis pipeline for multi-contrast systems has been developed and implemented at the P12 beamline of the EMBL (Petra-III, Hamburg). The pipeline facilitates the determination of the overall parameters and automated construction of tentative low-resolution models of solubilized PDCs. Given that the interpretation of the SAXS data from membrane proteins is by far non-trivial task, these results and models are expected to be further refined interactively. However, as demonstrated in the provided examples, AMPP offers at least a good starting point for the subsequent analysis. The relevant data and workflows are fully integrated into the ISPyB data curation system for SAXS. Future development of the pipeline is planned Figure 6. Examples of automatically reconstructed Aquaporin0 PDC models by the AMPP pipeline utilizing MEMPROT (hybrid modeling with known protein model (PDB:2B6P) and unknown detergent belt) (A), MONSA (multi-phase ab initio, the model was built from precalculated searching volume; fourfold symmetry around z-axis is imposed) (B) and DAMMIF (C) (single phase ab initio also with P4 symmetry applied). The first row shows the side view cross-section of the PDC components that is represented with a section inside the detergent corona; protein beads are green for ab initio models and colored according to atom type for the hybrid model, detergent tails and heads are blue and red, respectively. The second and third rows represent the top and the side view of the models. The figure is generated using the PyMol Molecular Graphics System, version 1.7.4. Schrödinger, LLC (www.pymol.org).

Sample
Modeling Software NSD www.nature.com/scientificreports www.nature.com/scientificreports/  . Automatically reconstructed T2 models by the AMPP pipeline utilizing MEMPROT (hybrid modeling: 3T9N pdb model with found shape of detergent corona) (A), MONSA (multi-phase ab initio with P7 symmetry) (B) and DAMMIF (single phase ab initio, P7 symmetry) (C). The first row shows the side view cross-section of the PDC components that is represented with a section inside the detergent corona; protein beads are green for ab initio models and colored according to atom type for the hybrid model, detergent tails and heads are blue and red, respectively. The second and third rows represent the top and the side view of the models. For T2, the K value was calculated from the reduced SEC-SAXS chromatographic data as follows: K = 100% • C v aver •<χ 2 > = 100• (0.2/5.65) • (2.46) = 8.7 < 10% (Eq. 1). K was not determined for Aquaporin0 in the present case, as the published data does not include the chromatographic SEC-SAXS data frames. The figure is generated using the PyMol Molecular Graphics System, version 1.7.4. Schrödinger, LLC (www.pymol.org).