Background & Summary

Serial femtosecond crystallography (SFX) utilises the ultrafast and ultrabright pulses of an X-ray free electron laser (XFEL) to overcome some of the challenges faced in conventional X-ray crystallography for biological structure determination1. Firstly, the ultrabright pulses provide the ability to measure sufficient X-ray diffraction from micrometer and sub-micrometer sized protein crystals2. Secondly, the brightness combined with the ultrafast X-ray pulse duration enables the collection of essentially radiation damage free3 diffraction data at room temperature2. The SFX method further enables structure determination in time-resolved systems where femtosecond time resolution is needed, such as in pump-probe4,5,6, irreversible or mixing experiments7,8. Hence SFX has significant potential as a tool for determining the structure of these challenging classes of biological molecules9.

The European XFEL (EuXFEL)10 is the first high repetition rate XFEL and uses a unique burst mode pulse structure to deliver up to 27000 electron bunches per second which are shared between the different self-amplified spontaneous emission (SASE) undulators11. The SPB/SFX instrument12 is located behind the SASE1 undulator and is capable of recording 3520 X-ray pulses per second with the MHz-capable, Adaptive Gain Integrating Pixel Detector (AGIPD)13. Bursts of X-ray pulses arrive at the instrument in trains of up to 352 pulses, with an intratrain repetition rate of up to 4.5 MHz and an intertrain rate of 10 Hz (enabling diffraction to be recorded at megahertz repetition rates).

The experimental challenges of increased repetition rate lie particularly in sample delivery and data analysis. SFX relies on illuminating a fresh crystal with each X-ray pulse, hence places a high demand on rapid and consistent sample delivery–typically in a liquid jet14.There is also an open question around the effects of XFEL induced shockwaves on crystals delivered in a liquid jet15,16. Generating 3520 diffraction images per second (~16 GB s−1) also places significant demand on data analysis. Each measured image needs calibration and classification followed by the extraction of crystallographic information, which requires a complex work flow. In SFX experiments, typically less than 10% of frames contain crystal diffraction, hence fast and accurate classification is critical for optimising sample preparation, sample delivery and efficient instrument operation.

This paper describes the deposition of an EuXFEL SFX dataset containing 19 million images17, recorded in approximately 1.5 hours by AGIPD, for structure determination of hen egg-white lysozyme (HEWL). HEWL has a well known structure, is very easy to crystallise and has been used in many investigations as a model system, also at XFELs18. This data deposition contains 9 different runs recorded using 4 different jet speeds. Each run has enough data to yield a structure in agreement with the known HEWL structure for all jet speeds. This data deposition contains both the raw and calibrated AGIPD data as well as the detector calibration constants used to calibrate the raw data. These data are suitable for algorithm development and testing for detector calibration, image classification and structure determination for use in future SFX experiments.


Sample preparation and delivery

Microcrystals of HEWL of size approximately 2 × 2 × 2 μm were grown using an established protocol18 and transferred to a storage solution of 10% NaCl, 0.1 M sodium acetate buffer with pH 4.0. A 25% (v/v) suspension was prepared and filtered through stainless steel frits with pore sizes of 20 and 10 μm before sample injection.

The filtered solution containing crystals was injected into the XFEL beam by gas dynamic virtual nozzles (GDVN) with helium as the focusing gas. The capillaries connecting the sample and gas reservoirs to the GDVN were each 2 m long and had inner and outer diameters of 100 and 360 μm respectively. The GDVN was 3D printed using a customised computer-aided design based on Design 6 by Knoška et al.19, The nozzle had a liquid orifice diameter of 75 μm, a gas orifice diameter of 60 μm and a distance between the liquid and gas orifices of 75 μm. The production of the GDVN is described in detail by Knoška et al.19.

Datasets were recorded for 4 different jet velocities. The sample delivery parameters are described in Table 1.

Table 1 Description of sample delivery conditions and corresponding run number.

Experimental parameters

This experiment was performed at the SPB/SFX instrument12 at the European XFEL in March, 2020. Microcrystals of HEWL in random orientations were illuminated by 9.3 keV X-ray pulses focused to a full-width-at-half-maximum of approximately 3.2 μm (horizontal) × 6.2 μm (vertical) at the interaction point. The AGIPD was located 129 mm downstream of the interaction point and recorded 300 X-ray pulses per train with an intratrain repetition rate of 1.1 MHz. The average pulse energy upstream of the focusing optics was 1.6 mJ, the pulse resolved X-ray energy is also included in the data deposition.

An off-axis microscope (Andor Zyla sCMOS with 10×  objective) having an effective pixel size of 1.3 μm recorded the X-ray-liquid-jet interaction at 10 Hz and is included in the data deposition (see Data Records section). The liquid jet was illuminated by the 800 nm SASE1 femtosecond pump-probe laser20. The illumination laser was operated at 10 Hz with each pulse arriving at the interaction point 110 ns after the first X-ray pulse in each train. An example image is shown in Fig. 1. The jet velocity was determined by measuring the distance the exploded part of the jet travelled in a known time. Depending on the jet speed, this was either determined by the time between subsequent X-ray pulses in a train or by shifting the illuminating laser delay a known amount21. These measurements were taken between runs and are not part of the data set.

Fig. 1
figure 1

Example of single crystal diffraction data measured by AGIPD (left). Off-axis microscope for monitoring the overlap of the liquid jet and X-ray beam (right). The image was acquired with a single 800 nm wavelength, 65 fs duration laser pulse from the SASE1 pump-probe laser system, 110 ns after the first X-ray pulse in the train.

Detector calibration

The AGIPD consists of 16 modules of x = 128 × y = 512 pixels each. The detector has three gain stages to cover the high dynamic range of one to several thousands photons per pixel. Each pixel has 352 analog memory cells (mc) which can store up to 352 images which consist of signal and gain information. The intensity measured in each AGIPD pixel and memory cell is described by two analog values, the analog signal and gain stage information13. To calibrate this raw signal, the relevant set of calibration constants is required. The calibration constants are derived using dedicated data sets. The set of constants required for calibrating the raw data are also included in the data deposition.

The list of calibration constants for each of the 16 AGIPD modules is provided in Table 2. The gain = 3 dimension indexes the high, medium or low gain stage. The SlopesFF array contains the relative high gain slope and intercept for first and second entries respectively and are generated from separate single photon flat field intensity measurements for identification of the single photon peak position. The constants in SlopesPC contain the l = 11 coefficients derived from the fit of the following functions to the data collected with the internal calibration source, the so-called pulsed capacitor data, used to scan high and medium intensity regions. First the linear region of the high gain stage is fit with the linear function:


where cl, for \(l\in \mathrm{0...10}\) describe the data with index l in the SlopesPC constants. The high gain to medium gain transition and medium gain region is then fit with:

$$y={c}_{7}\,\exp \left(\frac{x+{c}_{5}}{{c}_{6}}\right)+{c}_{3}x+{c}_{4}.$$
Table 2 Calibration constants and corresponding file addresses and data dimensions used for calibrating the raw data from each of the 16 AGIPD modules at SPB/SFX.

The remaining parameters contain the residuals of the fit to the data. Parameter c2 describes the absolute relative deviation from linearity for the high gain region, c8 describes the absolute relative deviation from the linear part of the function in the medium gain region and c9 describes the threshold value for high and medium gain separation. The last parameter, c10, is unused in the current calibration implementation.

The ThresholdsDark array contains the gain state thresholds between high gain and medium gain, the threshold between medium gain and low gain and the gain values for high, medium and low gains for n = 0…4 respectively and are applied on a per pixel, per memory cell basis.

The calibration process consists of the following steps:

  1. 1.

    Gain stage identification

    To be able to identify the gain stage for each pixel and memory cell, so called gain thresholding has to be performed. For this, the analogue gain signal of each pixel and memory cell is evaluated against two thresholds values from ThresholdsDark.

  2. 2.

    Offset correction

    In this step, the appropriate gain stage offset from the Offset array is subtracted from the raw data.

    It was observed that the intensities for some pixels in offset corrected images (using the constants derived from dark data) gets negative values and the effect get stronger for the higher intensities. To partially mitigate the issue we decided to use an opaque mask (‘stripes’) which occlude a small area of each detector module. Using the information from this “shadowed” area, the additional ‘offset’ adjustment on per image basis should be performed. The “baselineshift” offset value is calculated for each module separately.

  3. 3.

    Gain correction

Depending on the gain stage, memory cell, x and y position, a gain correction value is multiplied with the result of the previous step.

In addition, for pixels identified to be in Medium Gain stage additional offset is added (i.e. intercept from linear fit for MG which can be found in SlopesPC array).

Further information on calibration of AGIPD data and the generation of calibration constants can be found in the EuXFEL Report by J. Sztuk-Dambietz22.

Structure refinement

Each recorded run was processed independently using the CrystFEL software suite, version 0.9.123. Each frame was processed using peakfinder8 for peak identification and subsequent peaks were indexed using MOSFLM. Conservative values were used for the Bragg peak finding in this case. It has recently been shown that with improved hit-finding parameters and algorithms the number of frames where crystal diffraction is detected is greatly increased24. The integrated intensities were merged and processed using XSCALE from the XDS package25. Resulting reflection files were then passed to phenix.phaser using the PHENIX package GUI26. Molecular replacement methods were used to borrow phases from a modified lysozyme model (PDB:1IEE) where side-chains with multiple conformations were simplified to that with the highest occupancy. FreeR flags were added to 5% of the data via phenix, prior to any model refinement steps. Default model refinement steps, such as simulated annealing, rigid body, reciprocal space, and real space refinement were performed to acceptable data quality. The resulting unit cell parameters are shown in Tables 35.

Table 3 Individual data quality statistics and figures of merit for Run 79, Run 80 (50.8 m/s) and Run 95, Run 96 (44.0 m/s).
Table 4 Individual data quality statistics and figures of merit for Run 84, Run 85 (37.4 m/s) and Run 98, Run 99 (31.2 m/s).
Table 5 Data quality statistics and figures of merit for runs combined by jet velocities.

Data Records

The data deposited in the Coherent X-ray imaging Data Bank (CXIDB)27 contains approximately 19 million images in HDF5 format. The data set is divided up into runs which each contain about 10 minutes of data collection. The runs are further split across multiple HDF5 files. Raw data are located in the raw directory inside each run. Data are grouped in files according to detector and timestamp. Each AGIPD module is stored in a different file while other 10 Hz data are stored across other files. For example, the first 500 trains of data from AGPID module number 0 are stored in the file RAW-R0083-AGIPD00-S00000.h5. The calibrated data are then stored in CORR-R0083-AGIPD00-S00000.h5. The first 5000 trains of data in run 83 from the off-axis 10 Hz microscope are stored in data aggregator 3 file: RAW-R0083-DA03-S00000.h5. Data in files: CORR-RXXXX-AGIPD1MCTRLXX-SXXXXX.h5 contains detector specific configurations which are for beamline debugging purposes and not relevant to this data. Further information and description of the data can be found in the online European XFEL data analysis documentation28. The data can be found in ref. 17. A description of relevant data and process variables is given in Tables 6, 7.

Table 6 Relevant data sources and corresponding addresses within the deposited raw HDF5 data files.
Table 7 Relevant data sources and corresponding addresses within the deposited calibrated HDF5 data files.

Technical Validation

The calibrated diffraction data were analysed using the CrystFEL software suite23. The resulting unit cell showed excellent agreement with the well known HEWL unit cell, the unit cell parameters for each run are described in Tables 35. The unit cell parameters for run 97 are not shown but are almost identical to those found in run 96.